Abstract
Recent times have seen an explosive growth in the availability of various kinds of data. It has resulted in an unprecedented opportunity to develop automated data-driven techniques of extracting useful knowledge. Data mining, an important step in this process of knowledge discovery, consists of methods that discover interesting, non-trivial, and useful patterns hidden in the data [SAD+93, CHY96]. The field of data mining builds upon the ideas from diverse fields such as machine learning, pattern recognition, statistics, database systems, and data visualization. But, techniques developed in these traditional disciplines are often unsuitable due to some unique characteristics of today’s data-sets, such as their enormous sizes, high-dimensionality, and heterogeneity. There is a necessity to develop effective parallel algorithms for various data mining techniques. However, designing such algorithms is challenging, and the main focus of the paper is a description of the parallel formulations of two important data mining algorithms: discovery of association rules, and induction of decision trees for classification. We also briefly discuss an application of data mining to the analysis of large data sets collected by Earth observing satellites that need to be processed to better understand global scale changes in biosphere processes and patterns.
This work was supported by NSF CCR-9972519, by NASA grant # NCC 2 1231, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE grant LLNL/DOE B347714, and by Army High Performance Computing Research Center cooperative agreement number DAAD19-01-2-0014. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputer Institute. Related papers are available via WWW at URL: http://www.cs.umn.edu/~Rkumar.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925, December 1993. 116
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of 1993 ACM-SIGMOD Int. Conf. on Management of Data, Washington, D. C., 1993. 113
R. Agrawal and J.C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Eng., 8(6):962–969, December 1996. 114
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487–499, Santiago, Chile, 1994. 114
J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H.W. To, and D. Yang. Large scale data mining: Challenges and responses. In Proc. of the Third Int’l Conference on Knowledge Discoveryand Data Mining, 1997. 117
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Eng., 8(6):866–883, December 1996. 111, 112
D. J. Spiegelhalter D. Michie and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 116
S. Goil, S. Aluru, and S. Ranka. Concatenated parallelism: A technique for efficient parallel divide and conquer. In Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96), 1996. 117
D.E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman, 1989. 116
R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu. Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001. 112
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proc. of 1997 ACM-SIGMOD Int. Conf. on Management of Data, Tucson, Arizona, 1997. 114, 115
E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Eng., 12(3), May/June 2000. 115
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan-Kaufman, 2000. 112
D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001. 112
M.V. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Computer Science: Lecture Notes in Artificial Intelligence (LNCS/LNAI), volume 1759. Springer-Verlag, 2000. 113, 114, 115
M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proc. of the International Parallel Processing Symposium, 1998. 117, 120
M.V. Joshi, G. Karypis, and V. Kumar. Universal formulation of sequential patterns. Technical Report TR 99-021, Department of Computer Science, University of Minnesota, Minneapolis, 1999. 115
R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier Science,1997. 117
R. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22), April 1987. 116
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. 116
R. A. Pearson. A coarse grained parallel induction heuristic. In H. Kitano, V. Kumar, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, pages 207–226. Elsevier Science, 1994. 117
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, SanMateo, CA, 1993. 116
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd VLDB Conference, 1996. 116, 117, 120
A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237–261, September 1999. 117
M. Steinbach, P. Tan, V. Kumar, S. Klooster, and C. Potter. Temporal data mining for the discovery and analysis of ocean climate indices. In KDD Workshop on Temporal Data Mining(KDD’2002), Edmonton, Alberta, Canada, 2001. 122
M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and A. Reuter. DBMS research at a crossroads: The vienna update. In Proc. of the 19th VLDB Conference, pages 688–692, Dublin, Ireland, 1993. 111
P. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Finding spatio-temporal patterns in earth science data. In KDD Workshop on Temporal Data Mining(KDD’2001), San Francisco, California, 2001. 121
M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency (Special Issue on Data Mining), December 1999. 114
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kumar, V., Joshi, M.V., Han, EH.(., Tan, PN., Steinbach, M. (2003). High Performance Data Mining. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_8
Download citation
DOI: https://doi.org/10.1007/3-540-36569-9_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00852-1
Online ISBN: 978-3-540-36569-3
eBook Packages: Springer Book Archive