Abstract
In this paper, we address the ambitious task of formulating a general framework for data mining. We discuss the requirements that such a framework should fulfill: It should elegantly handle different types of data, different data mining tasks, and different types of patterns/models. We also discuss data mining languages and what they should support: this includes the design and implementation of data mining algorithms, as well as their composition into nontrivial multi-step knowledge discovery scenarios relevant for practical application. We proceed by laying out some basic concepts, starting with (structured) data and generalizations (e.g., patterns and models) and continuing with data mining tasks and basic components of data mining algorithms (i.e., refinement operators, distances, features and kernels). We next discuss how to use these concepts to formulate constraint-based data mining tasks and design generic data mining algorithms. We finally discuss how these components would fit in the overall framework and in particular into a language for data mining and knowledge discovery.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proc. of the ACM SIGMOD Conf. on Management of Data, pp. 207–216. ACM Press, New York (1993)
Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison-Wesley, Reading, MA (1983)
Allison, L.: Models for machine learning and data mining in functional programming. Journal of Functional Programming 15(1), 15–32 (2004)
R. Bayardo (ed.) Constraints in data mining. Special issue of SIGKDD Explorations, 4(1) (2002)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
Bistarelli, S., Bonch, F.: Interestingness is not a Dichotomy: Introducing Softness in Constrained Pattern Mining. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, Springer, Heidelberg (2005)
Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proc. of the 15th Intl. Conf. on Machine Learning, pp. 55–63. Morgan Kaufmann, San Mateo, CA (1998)
Boulicaut, J.-F., Jeudy, B.: Constraint-based data mining. In: Maimon, O., Rokach, L. (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 399–416. Springer, Berlin (2005)
Boulicaut, J.-F., Masson, C.: Data mining query languages. In: Maimon, O., Rokach, L. (eds.) The Data Mining and Knowledge Discovery Handbook, Springer, Berlin (2005)
Boulicaut, J.-F., Klemettinen, M., Mannila, H.: Modeling KDD processes within the inductive database framework. In: Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 293–302. Springer, Heidelberg (1999)
Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.): Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848. Springer, Heidelberg (2006)
Bracewell, R.N.: The Fourier Transform and Its Applications. McGraw-Hill, New York (1965)
Calders, T., Rigotti, C., Boulicaut, J.-F.: A survey on condensed representations for frequent sets. In: Boulicaut, J-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 64–80. Springer, Heidelberg (2006)
Calders, T., Goethals, B., Prado, A.B.: Integrating pattern mining in relational databases. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 454–461. Springer, Heidelberg (2006a)
Calders, T., Lakshmanan, L.V.S., Ng, R.T., Paredaens, J.: Expressive power of an algebra for data mining. ACM Transactions on Database Systems 31(4), 1169–1214 (2006b)
Cheng, H., Yan, X., Han, J., Hsu, C.-W.: Discriminative frequent pattern analysis for effective classification. In: Proc. 23nd Intl. Conf. on Data Engineering, pp. 716–725. IEEE Computer Society Press, Los Alamitos (2007)
Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms. Wiley & Sons, New York (2001)
De Raedt, L., Dehaspe, L.: Clausal discovery. Machine Learning 26, 99–146 (1997)
Dehaspe, L., Toivonen, H.: Discovery of frequent Datalog patterns. Data Mining and Knowledge Discovery 3(1), 7–36 (1999)
De Raedt, L.: A perspective on inductive databases. SIGKDD Explorations 4(2), 69–77 (2002a)
De Raedt, L.: Data mining as constraint logic programming. In: Kakas, A.C., Sadri, F. (eds.) Computational Logic: Logic Programming and Beyond. LNCS (LNAI), vol. 2408, pp. 113–125. Springer, Heidelberg (2002b)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Džeroski, S.: Inductive logic programming in a nutshell. In: Getoor, L., Taskar, B. (eds.) Statistical Relational Learning, MIT Press, Cambridge, MA (2007)
Džeroski, S., Lavrač, N. (eds.): Relational Data Mining. Springer, Berlin (2001)
Džeroski, S., Todorovski, L., Ljubič, P.: Inductive queries on polynomial equations. In: Boulicaut, J-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 127–154. Springer, Heidelberg (2006)
Fayyad, U., Piatetsky-Shapiro, G., Uthurusamy, R.: Summary from the KDD-2003 panel – “Data Mining: The Next 10 Years”. SIGKDD Explorations 5(2), 191–196 (2003)
Friedman, J.H., Fisher, N.I.: Bump hunting in high-dimensional data. Statistics and Computing 9(2), 123–143 (1999)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: An overview. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 495–515. MIT Press, Cambridge, MA (1996)
Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J.: Knowledge discovery in databases: An overview. In: Knowledge Discovery in Databases, pp. 1–30. AAAI/MIT Press, Cambridge
Gaertner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(1), 49–58 (2003)
Garofalakis, M., Hyun, D., Rastogi, R., Shim, K.: Building decision trees with constraints. Data Mining and Knowledge Discovery 7(2), 187–214 (2003)
Getoor, L., Taskar, B. (eds.): Statistical Relational Learning. MIT Press, Cambridge, MA (2007)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proc. of the 21st Intl. Conf. on Data Engineering, pp. 341–352. IEEE Computer Society Press, Los Alamitos (2005)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA (2001)
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge, MA (2001)
Haussler, D.: Convolution kernels on discrete structures. UC Santa Cruz, Technical Report UCS-CRL-99-10 (1999)
Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Communications of the ACM 39(11), 58–64 (1996)
Johnson, T., Lakshmanan, L.V., Ng, R.: The 3W model and algebra for unified data mining. In: Proc. of the Intl. Conf. on Very Large Data Bases, pp. 21–32. Morgan Kaufmann, San Francisco, CA (2000)
Kalousis, A., Woznica, A., Hilario, M.: A unifying framework for relational distance-based learning founded on relational algebra. Technical Report, Computer Science Department, University of Geneva (2006)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley & Sons, New York (1990)
King, R.D., Karwath, A., Clare, A., Dehaspe, L.: The utility of different representations of protein sequence for predicting functional class. Bioinformatics 17(5), 445–454 (2001)
Kloesgen, W.: Data mining tasks and methods: Subgroup discovery: deviation analysis. In: Kloesgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 354–361. Oxford University Press, Oxford (2002)
Kramer, S., Aufschild, V., Hapfelmeier, A., Jarasch, A., Kessler, K., Reckow, S., Wicker, J., Richter, L.: Inductive Databases in the Relational Model: The Data as the Bridge. In: Bonchi, F., Boulicaut, J.-F. (eds.) KDID 2005. LNCS, vol. 3933, pp. 124–138. Springer, Heidelberg (2006)
Lavrač, N., Kavšek, B., Flach, P.A., Todorovski, L.: Subgroup Discovery with CN2-SD. Journal of Machine Learning Research 5, 153–188 (2004)
Lavrač, N., Džeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Horwood, Chichester (1994)
Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer, Dorderecht (1998)
Lloyd, J.W.: Foundations of Logic Programming. Springer, Berlin (1987)
Lloyd, J.W.: An introduction to deductive database systems. Australian Computer Journal 15(2), 52–57 (1983)
Lloyd, J.W.: Logic for Learning. Springer, Berlin (2003)
Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999)
Inductive databases vision: Relational operations on models. Unpublished slides. In: Presented at the meeting of the cInQ project (December 2001)
Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997)
Michalski, R.S.: Knowledge acquisition through conceptual clustering: A theoretical framework and an algorithm for partitioning data into conjunctive concepts. Intl. Jrnl. of Policy Analysis and Information Systems 4, 219–244 (1980)
Mitchell, T.M.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982)
Nijssen, S., Fromont, E.: Mining optimal decision trees from itemset lattices. In: Proc. of The 13th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, ACM Press, New York (to appear, 2007)
Piatetsky-Shapiro, G., Djeraba, C., Getoor, L., Grossman, R., Feldman, R., Zaki, M.: What are the grand challenges for data mining? KDD-2006 Panel report. SIGKDD Explorations 8(2), 70–77 (2006)
Ramakrishnan, R., et al.: Data Mining: The Next Generation. In: Ramakrishnan, R., Agrawal, R., Freytag, J.-C. (eds.) Perspectives Wshp. – Data Mining: The Next Generation. Intl. Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany (2005)
Ramon, J., Bruynooghe, M.: A polynomial time computable metric between point sets. Acta Informatica 37(10), 765–780 (2001)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Siebes, A.: Data mining in inductive databases. In: Bonchi, F., Boulicaut, J-F. (eds.) KDID 2005. LNCS, vol. 3933, pp. 1–23. Springer, Heidelberg (2006)
Srinivasan, A., King, R.D.: Feature construction with inductive logic programming: A study of quantitative predictions of biological activity aided by structural attributes. Knowledge Discovery and Data Mining 3(1), 37–57 (1999)
Struyf, J., Džeroski, S.: Constraint based induction of multi-objective regression trees. In: Bonchi, F., Boulicaut, J-F. (eds.) KDID 2005. LNCS, vol. 3933, pp. 222–233. Springer, Heidelberg (2006)
Termier, A., Tamada, Y., Imoto, S., Washio, T., Higuchi, T.: From closed tree mining towards closed DAG mining. In: Proc. of the Intl. Wshp. on Data Mining and Statistical Science, pp. 1–7 (2006)
Thompson, S.: Haskell: The Craft of Functional Programming. Add. Wesley, Reading (1999)
Tušar, T.: Design of an Algorithm for Multiobjective Optimization with Differential Evolution. M.Sc. Thesis. Faculty of Computer and Information Science, University of Ljubljana, Slovenia (2007)
Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review 18(2), 77–95 (2002)
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: Proc. 17th Intl. Conf. on Machine Learning, pp. 1103–1110. Morgan Kaufmann, San Francisco, CA (2000)
Woznica, A., Kalousis, A., Hilario, M.: Kernels on lists and sets over relational algebra: an application to classification of protein fingerprints. In: Ng, W-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 546–551. Springer, Heidelberg (2006)
Yang, Q., Wu, X.: 10 Challenging problems in data mining research. Intl. Jrnl. of Information Technology & Decision Making 5(4), 597–604 (2006)
Ženko, B., Džeroski, S., Struyf, J.: Learning predictive clustering rules. In: Bonchi, F., Boulicaut, J-F. (eds.) KDID 2005. LNCS, vol. 3933, pp. 234–250. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Džeroski, S. (2007). Towards a General Framework for Data Mining. In: Džeroski, S., Struyf, J. (eds) Knowledge Discovery in Inductive Databases. KDID 2006. Lecture Notes in Computer Science, vol 4747. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75549-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-75549-4_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75548-7
Online ISBN: 978-3-540-75549-4
eBook Packages: Computer ScienceComputer Science (R0)