A Survey of Methods for Scaling Up Inductive Algorithms

Provost, Foster; Kolluri, Venkateswarlu

doi:10.1023/A:1009876119989

A Survey of Methods for Scaling Up Inductive Algorithms

Published: June 1999

Volume 3, pages 131–169, (1999)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Foster Provost¹ &
Venkateswarlu Kolluri^2,3

769 Accesses
167 Citations
Explore all metrics

Abstract

One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal, R. and Shim, K. 1995. Developing tightly-coupled applications on IBM DB2/CS relational database system: Methodology and experience. Research Report RJ 10005(89094), IBM Corporation.
Agrawal, R. and Shim, K. 1996. Developing tightly-coupled data mining applications on a relational database system. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 287-290.
Google Scholar
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. Research Report RJ9839, IBM Corporation.
Ali, K.M. and Pazzani, M.J. 1996. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173-202.
Google Scholar
Almuallim, H., Akiba, Y., and Kaneda, S. 1995. On handling tree-structure attributes in decision tree learning. Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann.
Andersen, W., Hendler, J., Evett, M., and Kettler, B. 1994. Massively parallel matching of knowledge structures. In Massively Parallel Artificial Intelligence, H. Kitano and J. Hendler (Eds.). AAAI/MIT Press.
Aronis, J., Kolluri, V., Provost, F., and Buchanan, B. 1997. The WoRLD: Knowledge discovery from multiple distributed databases. Proceedings of Florida Artificial Intelligence Research Symposium (FLAIRS-97).
Aronis, J. and Provost, F. 1994. Efficiently constructing relational features from background knowledge for inductive machine learning. Working Notes of the AAAI-94 Workshop on Knowledge Discovery in Databases, Seattle, WA.
Aronis, J. and Provost, F. 1997. Increasing the efficiency of data mining algorithms with breadth-first marker propagation. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA.
Aronis, J., Provost, F., and Buchanan, B. 1996. Exploiting background knowledge in automated discovery. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 355-358.
Google Scholar
Auer, P., Holte, R., and Maas, W. 1995. Theory and applications of agnostic PAC-learning with small decision trees. Proceedings of the Twelveth International Conference on Machine Learning, pp. 21-29.
Blockeel, H., De Raedt, L., Jacobs, N., and Demoen, B. 1999. Scaling up inductive logic programming by learning from interpretations. Data Mining and Knowledge Discovery, 3(1):59-93.
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Wadsworth International Group.
Brockhausen, P. and Morik, K. 1996. Direct access of an ILP algorithm to a database management system. Proceedings of MLnet Sponsored Familiarization Workshop Data Mining with Inductive Logic Programming.
Buchanan, B. and Feigenbaum, E. 1978. DENDRAL and META-DENDRAL: Their applications dimensions. Artificial Intelligence, 11:5-24.
Google Scholar
Buchanan, B.G., Smith, D.H., White, W.C., Gritter, R., Feigenbaum, E.A., Lederberg, J., and Djerassi, C. 1976. Applications of artificial intelligence for chemical inference, xxii. automatic rule formation in mass spectrometry by means of the META-DENDRAL program. Journal of the American Chemical Society, 96:6168.
Google Scholar
Buntine, W. 1991. A theory of learning classification rules. Ph.D. Thesis, School of Computer Science, University of Technology, Sydney, Australia.
Google Scholar
Catlett, J. 1991a. Megainduction: A test flight. Proceedings of the Eighth International Workshop on Machine Learning, Morgan Kaufmann, pp. 596-599.
Catlett, J. 1991b. Megainduction: Machine learning on very large databases. Ph.D. Thesis, School of Computer Science, University of Technology, Sydney, Australia.
Google Scholar
Chan, P. and Stolfo, S. 1993. Toward parallel and distributed learning by meta-learning. Working Notes AAAI Workshop Knowledge Discovery in Databases, pp. 227-240.
Chan, P. and Stolfo, S. 1997. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8:5-28.
Google Scholar
Chan, P. and Stolfo, S. 1998. Towards scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168.
Chen, M., Han, J., and Yu, P. 1997. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering.
Chen, M.-S. and Yu, P. 1995. Using multi-attribute predicates for mining classification rules. Technical report, IBM Research Report.
Clearwater, S., Cheng, T., Hirsh, H., and Buchanan, B. 1989. Incremental batch learning. Proceedings of the Sixth International Workshop on Machine Learning, San Mateo CA: Morgan Kaufmann, pp. 366-370.
Google Scholar
Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second International IEEE Conference on Tools for Artificial Intelligence, IEEE C.S. Press, pp. 24-30.
Cohen, W.W. 1993. Efficient pruning methods for separate-and-conquer rule learning systems. Thirteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp. 988-994.
Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, pp. 115-123.
Cook, D. and Holder, L. 1990. Accelerated learning on the connection machine. Proceedings of the Second International IEEE Conference on Tools for Artificial Intelligence, San Mateo CA: Morgan Kaufmann, pp. 366-370.
Google Scholar
Craven, M.W. 1996. Extracting Comprehensible Models from Trained Neural Networks. Ph.D. Thesis, University of Wisconson-Madison. Technical Report No. 1326.
Danyluk, A. and Provost, F. 1993. Small disjuncts in action: Learning to diagnose errors in the telephone network local loop. In Machine Learning: Proceedings of the Tenth International Conference, P. Utgoff (Ed.). Morgan Kaufmann Publishers, Inc., pp. 81-88.
DesJardins, M. and Gordon, D.F. 1995. Special issue on bias evaluation and selection. Machine Learning, 20(1/2).
Devijver, P.A. and Kittler, J. 1982. Pattern Recognition: A Statistical Approach. Prentice Hall.
Dietterich, T.G. 1997. Machine learning research: Four current directions. AI Magazine, 18(4):97-136.
Google Scholar
Domingos, P. 1996a. Efficient specific-to-general rule induction. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 319-322.
Google Scholar
Domingos, P. 1996b. Linear time rule induction. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 96-101.
Google Scholar
Domingos, P. 1997. Knowledge acquisition from examples via multiple models. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), D.H. Fisher (Ed.). San Francisco, CA: Morgan Kaufmann, pp. 98-106.
Google Scholar
Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. New York: John Wiley.
Google Scholar
Evett, M. 1994. PARKA: A System for Massively Parallel Knowledge Representation. Ph.D. Thesis, Department of Computer Science, University of Maryland, College Park, Maryland.
Google Scholar
Fawcett, T. and Provost, F. 1997. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291-316.
Google Scholar
Fayyad, U. 1997. Editorial. Data Mining and Knowledge Discovery, 1(1):5-10.
Google Scholar
Fayyad, U., Haussler, D., and Stolorz, P. 1996. KDD for science data analysis: Issues and examples. Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, Menlo Park, CA: AAAI Press, pp. 50-56.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996a. Knowledge discovery and data mining: Towards a unifying framework. Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, Menlo Park, CA: AAAI Press, pp. 82-88.
Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., and Smyth, P. 1996b. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthursamy (Eds.). Menlo Park, CA: AAAI Press.
Google Scholar
Fayyad, U., Weir, N., and Djorgovski, S. 1993. SKICAT: A machine learning system for automated cataloging of large scale sky surveys. Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann.
Fox, E.A., Akscyn, R.M., Furuta, R., and Legsett, J. 1995. Communications of the ACM, 38(4), Morgan Kaufmann.
Freitas, A. and Lavington, S. 1996. Using SQL primitives and parallel DB servers to speed up knowledge discovery in large relational databases. Cybernetics and Systems'96: Proceedings of the Thirteenth European Meeting on Cybernetics and Systems Research, pp. 955-960.
Freitas, A.A. and Lavington, S.H. 1997. Mining Very Large Databases with Parallel Processing. Norwell, MA: Kluwer Academic Publishers.
Google Scholar
Frey, L.J. and Fisher, D.H. 1999. Modeling decision tree performance with the power law. In Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, D. Heckerman and J. Whittaker (Eds.). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Friedman, J.H. 1997. Data mining and statistics: What's the connection? Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics.
Fürnkranz, J. 1998. Integrative windowing. Journal of Artificial Intelligence Research, 8:129-164.
Google Scholar
Fürnkranz, J. and Widmer, G. 1994. Incremental reduced error pruning. Proceedings of the Eleventh International Machine Learning Conference, New Brunswick: Morgan Kaufmann.
Google Scholar
Gaines, B. 1989. An ounce of knowledge is worth a ton of data: Quantitative studies of the trade-off between expertise and data based on statistically well-founded empirical induction. Proceedings of the Sixth International Workshop on Machine Learning, San Mateo, CA: Morgan Kaufmann, pp. 156-159.
Google Scholar
Galal, G., Cook, D.J., and Holder, L. 1999. Exploiting parallelism in a scientific discovery system to improve scalability. Journal of the American Society for Information Science. (In press).
Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest—A framework for fast decision tree construction of large datasets. Proceedings of the Twenty-Fourth International Conference on Very Large Data Bases, New York, NY.
Graefe, G., Fayyad, U., and Chaudhuri, S. 1998. On the efficient gathering of sufficient statistics for classification of large SQL databases. Proceedings of the Fourth International Conference on Knowledge Discovery and Data-Mining, New York, NY: AAAI Press.
Google Scholar
Grossman, R. and Bailey, S. (1998). A tutorial introduction to high performance data mining. Tutorial given at the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98).
Guo, Y. and Sutiwaraphun, J. 1998. Knowledge probing in distributed data mining. Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp. 61-69.
Haines, T.L. 1998. Private communication.
Hall, L.O., Chawla, N., and Bowyer, K.W. 1998. Combining decision trees learned in parallel. Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp. 10-15.
Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O.R. 1996. DB Miner: A system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, Menlo Park, CA: AAAI Press, pp. 250-255.
Google Scholar
Harris-Jones, C. and Haines, T.L. 1997. Sample size and misclassification: Is more always better? Working Paper AMSCAT-WP-97-118, AMS Center for Advanced Technologies.
Haussler, D. 1988. Quantifiying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.
Google Scholar
Hayes, P. 1979. The logic of frames. In Frame Conceptions and Text Understanding, D. Metzing (Ed.). de Gruyter, pp. 46-61.
Holsheimer, M., Kersten, M., and Siebes, A. 1996. Data Surveyor: Searching the nuggets in parallel. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthursamy (Eds.). Menlo Park: AAAI Press, pp. 447-467.
Google Scholar
Holte, R. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 3:63-91.
Google Scholar
Holte, R., Acker, L., and Porter, B. 1989. Concept learning and the problem of small disjuncts. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, San Mateo, CA: Morgan Kaufmann, pp. 813-818.
Google Scholar
Huber, P. 1997. From large to huge: A statistician's reaction to KDD and DM. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 304-308.
Google Scholar
Iba, W. and Langley, P. 1992. Induction of one-level decision trees. Proceedings of Ninth International Conference on Machine Learning, Morgan Kaufmann, pp. 233-240
Jensen, D. 1998. Private communication.
Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.
John, G. and Langley, P. 1996. Static versus dynamic sampling for data mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 367-370.
John, G. and Lent, B. 1997. SIPping from the data firehose. Proceedings of Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 199-202.
Google Scholar
Kargupta, H. and Chan, P. (Eds.). 1998. KDD-98 Workshop on Distributed Data Mining.
Kargupta, H., Johnson, E., Sanseverino, E.R., Park, B.-H., Silvestre, L.D., and Hershberger, D. 1998. Scalable data mining from distributed, vertically partitioned feature space using collective mining and gene expression based genetic algorithms. Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp. 70-91. http://www.eecs.wsu.edu/~hillol/pubs/bodhi.ps.Z.
Karp, P.D. and Paley, S.M. 1995. Knowledge representation in the large. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press, pp. 751-758.
Google Scholar
Karp, P.D., Paley, S.M., and Greenberg, I. 1994. A storage system for scalable knowledge representation. Proceedings of the Third International Conference on Information and Knowledge Management.
Kaufman, K. and Michalski, R. 1996. A method for reasoning with structured and continuous attributes in the INLEN-2 knowledge discovery system. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 232-237.
Google Scholar
Kearns, M. 1993. Efficient noise-tolerant learning from statistical queries. Proceedings of the Twenty-Fifth ACM Symposium on the Theory of Computing, New York, NY: ACM Press, pp. 392-401.
Google Scholar
Kohavi, R. 1996. Wrappers for Performance Enhancement and Oblivious Decision Graphs. Ph.D. Thesis, Dept. of Computer Science, Stanford University, Palo Alto, CA.
Google Scholar
Kohavi, R. 1998. Crossing the chasm: From academic machine learning to commercial data mining. Invited talk for the Fifteenth International Conference on Machine Learning.
Kohavi, R. and John, G. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273-324.
Google Scholar
Kohavi, R. and Sommerfield, D. 1995. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press.
Google Scholar
Kononenko, I. 1994. Estimating attributes: Analysis and extensions of Relief. In Proceedings of the European Conference on Machine Learning, F. Bergadano and L.D. Raedt (Eds.).
Kononenko, I., Simec, E., and Robnik-Sikonja, M. (1997). Overcoming the myopia of inductive learning algorithms with RELIEFF. Applied Intelligence, 7:39-55.
Google Scholar
Kufrin, R. 1997. Generating C4.5 production rules in parallel. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), Menlo Park, CA: AAAI Press, pp. 565-670.
Google Scholar
Kumar, V. and Rao, V. 1987. Parallel depth-first search, part 2: Analysis. International Journal of Parallel Programming, 16:501-519.
Google Scholar
Lathrop, R., Webster, T., Smith, T., and Winston, P. 1990. ARIEL: A massively parallel symbolic learning assistant for protein structure/function. In AI at MIT: Expanding Frontiers, P. Winston and S. Shellard (Eds.). Cambridge, MA, MIT Press.
Google Scholar
Li, B. (1998). Free Parallel Data Mining. Ph.D. Thesis, Department of Computer Science, New York University.
Lim, T.-J., Loh, W.-Y., and Shih, Y.-S. 1999. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, to appear.
Littlestone, N. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318.
Google Scholar
Mehta, M., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. Proceedings of the Fifth International Conference on Extending Database Technology (EDBT), Avignon, France.
Merz, C.J. and Murphy, P.M. 1997. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Miller, A. 1990. Subset Selection in Regression. Chapman and Hall.
Mitchell, T. 1982. Generalization as search. Artificial Intelligence, 18(2):203-226.
Google Scholar
Mitchell, T.M. 1980. The need for biases in learning generalizations. Technical Report Report CBM-TR-117, New Brunswick, NJ: Rutgers University.
Google Scholar
Moore, A. and Lee, M. 1994. Efficient algorithms for minimizing cross validation error. Proceedings of the Eleventh International Conference on Machine Learning, Morgan Kaufmann.
Moore, A. and Lee, M. 1998. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:67-91.
Google Scholar
Muggleton, S. 1992. Inductive Logic Programming. London: Academic Press Ltd.
Google Scholar
Musick, R. 1998. Supporting large-scale computational science. Technical Report UCRL-ID-129903, Center for Applied Scientific Computing, Lawrence Livermore National Lab.
Musick, R., Catlett, J., and Russell, S. 1993. Decision theoretic subsampling for induction on large databases. Proceedings of the Tenth International Conference on Machine Learning, San Mateo, CA: Morgan Kaufmann, pp. 212-219.
Google Scholar
Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In Machine Learning: Proceedings of the Fourteenth International Conference, D. Fisher (Ed.). Morgan Kaufmann, pp. 254-262.
Oates, T. and Jensen, D. 1998. Large data sets lead to overly complex models: An explanation and a solution. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-99), R. Agrawal and P. Stolorz (Eds.). Menlo Park, CA: AAAI Press, pp. 294-298.
Google Scholar
Pagallo, G. and Haussler, D. 1990. Boolean feature discovery in empirical learning. Machine Learning, 5:71-99.
Google Scholar
Piatetsky-Shapiro, G., Brachman, R., Khabaza, T., Kloesgen, W., and Simoudis, E. 1996. An overview of issues in developing industrial data mining and knowledge discovery applications. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 89-95.
Google Scholar
Prodromidis, A.L. and Stolfo, S.J. 1998. Pruning meta-classifiers in a distributed data mining system. Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp. 22-30.
Provost, F. 1992. Policies for the Selection of Bias in Inductive Machine Learning. Ph.D. Thesis, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA.
Google Scholar
Provost, F.J. 1993. Iterative weakening: Optimal and near-optimal policies for the selection of search bias. Proceedings of the Eleventh National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press, pp. 749-755.
Google Scholar
Provost, F. and Aronis, J. 1996. Scaling up inductive learning with massive parallelism. Machine Learning, 23:33-46.
Google Scholar
Provost, F. and Buchanan, B. 1995. Inductive policy: The pragmatics of bias selection. Machine Learning, 20:35-61.
Google Scholar
Provost, F.J. and Hennessy, D. 1994. Distributed machine learning: Scaling up with coarse-grained parallelism. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology.
Provost, F. and Hennessy, D. 1996. Scaling up: Distributed machine learning with cooperation. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press.
Google Scholar
Provost, F., Jensen, D., and Oates, T. 1999. Efficient progressive sampling. To appear in the ACMSLGKDD Fifth International Conference on Knowledge Discovery and Data Mining.
Provost, F. and Kolluri, V. 1997a. Scaling up inductive algorithms: An overview. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, pp. 239-242.
Google Scholar
Provost, F. and Kolluri, V. 1997b. A survey of methods for scaling up inductive learning. Technical Report ISL-97-3, Pittsburgh, PA: Intelligent Systems Laboratory, University of Pittsburgh. http://www.pitt.edu/~uxkst/surveypaper.ps.
Google Scholar
Quinlan, J. 1983. Learning efficient classification procedures and their application to chess endgames. In Machine Learning: An AI approach, R. Michalski, C.J., and T. Mitchell (Eds.). Los Altos, CA: Morgan Kaufmann.
Google Scholar
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1:81-106.
Google Scholar
Quinlan, J. 1987. Simplifying decision trees. International Journal of Man-Machine Studies, 27:221-234.
Google Scholar
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Rao, V. and Kumar, V. 1987. Parallel depth-first search, part 1: Implementation. International Journal of Parallel Programming, 16:479-499.
Google Scholar
Ribeiro, J., Kaufmann, K., and Kerschberg, K. 1995. Knowledge discovery from multiple databases. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp. 240-245.
Google Scholar
Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a Boeing manufacturing domain. Applied Artificial Intelligence, 8:125-147.
Google Scholar
Rymon, R. 1993. An SE-tree based characterization of the induction problem. Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann.
Sarawagi, S., Thomas, S., and Agrawal, R. 1998. Integrating association rule mining with relational database systems: Alternatives and implications. Proceedings of the ACM SIGMOD International Conference on Management of Data.
Savasere, A., Omiecinski, E., and Navathe, S. 1995. An efficient algorithm for mining association rules in large databases. Proceedings of Twenty-First International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 432-444.
Schlimmer, J.C. 1993. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the Tenth International Conference on Machine Learning, P. Utgoff (Ed.). San Mateo, CA: Morgan Kaufmann, pp. 284-290.
Google Scholar
Segal, R. and Etzioni, O. 1994. Learning decision lists using homogeneous rules. Proceedings of the Twelfth National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press, pp. 619-625.
Google Scholar
Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. Proceedings of the Twenty-Second International Conference on Very Large Data Bases, Mumbai, India.
Shasha, D. 1998. PC4.5. http://merv.cs.nyu.edu:8001/~binli/pc4.5.
Shavlik, J.W., Mooney, R.J., and Towell, G.G. 1991. An experimental comparison of symbolic and connectionist learning algorithms. Machine Learning, 6(2):111-143.
Google Scholar
Simon, H. and Lea, G. 1973. Problem solving and rule induction: A unified view. In Knowledge and Cognition, Gregg (Ed.). New Jersey: Lawrence Erlbaum Associates, pp. 105-127.
Google Scholar
Smyth, P. and Goodman, R. 1992. An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301-316.
Google Scholar
Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal.
Stolfo, S. 1998. http://www.cs.columbia.edu/~sal/JAM/PROJECT.
Stolfo, S., Fan, D., Lee, W., Prodromidis, A., and Chan, P. 1997. Credit card fraud detection using meta-learning: Issues and initial results. Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park: CA: AAAI Press, pp. 83-90.
Google Scholar
Stolfo, S., Prodromidis, A., Tselepis, S., Fan, D., Lee, W., and Chan, P. 1997. JAM: Java agents for meta-learning over distributed databases. Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park: CA: AAAI Press, pp. 91-98.
Google Scholar
Toivonen, H. 1996. Sampling large databases for association rules. Proceedings of the Twenty-fourth International Conference on Very Large Data Bases.
Utgoff, P. 1989. Incremental induction of decision trees. Machine Learning, 4:161-186.
Google Scholar
Valiant, L.G. 1984. A theory of the learnable. Communications of the ACM, 27(11):1134-1142.
Google Scholar
Webb, G. 1995. OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3:383-417.
Google Scholar
Weiss, S.M., Galen, R.S., and Tadepalli, P.V. 1990. Maximizing the predictive value of production rules. Artificial Intelligence, 45:47-71.
Google Scholar
Wettschereck, D., Aha, D.W., and Mohri, T. 1997. A review and comparative evaluation of feature weighting methods for lazy learning algorithms. Artificial Intelligence Review, 10:1-37. Also Technical Report AIC-95-012, Naval Research Laboratory.
Google Scholar
Wettschereck, D. and Dietterich, T.G. 1995. An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms. Machine Learning, 19(1):5-28.
Google Scholar
Williams, G. 1990. Inducing and Combining Multiple Decision Trees. Ph.D. Thesis, Australian National University, Canberra, Australia.
Google Scholar
Wu, X. and Lo, W.H. 1998. Multi-layer incremental induction. Proceedings of the Fifth Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag, pp. 24-32.
Zaki, M. 1998. Scalable Data Mining for Rules. Ph.D. Thesis, Department of Computer Science, University of Rochester, Rochester, NY.
Google Scholar
Zaki, M.J., Ho, C., and Agrawal, R. 1999. Scalable parallel classification for data mining on shared memory multiprocessors. Proceedings of IEEE International Conference on Data Engineering.
Zaki, M.J., Parthasarathy, S., Li, W., and Ogihara, M. 1997. Evaluation of sampling for data mining of association rules. Proceedings of the Seventh International Workshop on Research Issues in Data Engineering.

Download references

Author information

Authors and Affiliations

Bell Atlantic Science and Technology, 500 Westchester Avenue, White Plains, New York, 10604
Foster Provost
Department of Information Science, University of Pittsburgh, Pittsburgh, PA, 15260
Venkateswarlu Kolluri
Lycos, Inc., 5001 Centre Avenue, Pittsburgh, PA, 15213
Venkateswarlu Kolluri

Authors

Foster Provost
View author publications
You can also search for this author in PubMed Google Scholar
Venkateswarlu Kolluri
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Provost, F., Kolluri, V. A Survey of Methods for Scaling Up Inductive Algorithms. Data Mining and Knowledge Discovery 3, 131–169 (1999). https://doi.org/10.1023/A:1009876119989

Download citation

Issue Date: June 1999
DOI: https://doi.org/10.1023/A:1009876119989

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Methods for Scaling Up Inductive Algorithms

Abstract

Access this article

Similar content being viewed by others

ScaLeKB: scalable learning and inference over large knowledge bases

Logics for Representing Data Mining Tasks in Inductive Databases

Incremental Update of Locally Optimal Classification Rules

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Survey of Methods for Scaling Up Inductive Algorithms

Abstract

Access this article

Similar content being viewed by others

ScaLeKB: scalable learning and inference over large knowledge bases

Logics for Representing Data Mining Tasks in Inductive Databases

Incremental Update of Locally Optimal Classification Rules

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation