Abstract
The parallelizing compiler community has traditionally focused its efforts on scientific applications. This paper gives an overview of a compiler/runtime project targeting parallel and scalable execution of data mining algorithms. To the best of our knowledge, this is the first project with such a focus.
Data mining is the process of analyzing large datasets for extracting novel and useful patterns or models. Though a lot of effort has been put into developing parallel algorithms for data mining tasks, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is an impediment in the wide use of parallel computers for data mining.
We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk resident datasets. Our final goal is to start from declarative mining operators, and translate them to data parallel Java.
In this paper, we describe the commonality among different data mining algorithms, the middleware and its interface, the data parallel dialect of Java, and the compilation techniques required for generating the middleware specification. Experimental evaluations of the middleware and the compiler are also presented.
This work was supported by NSF grant ACR-9982097, NSF CAREER award ACI-9733520, and NSF grant CCR-980852.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
H. Agrawal, R. A. DeMillo, and E. H. Spafford. Dynamic slicing in the presence of unconstrained pointers. In Proceedings of the ACM Fourth Symposium on Testing, Analysis and Verification (TAV 4), pages 60–73, 1991.
R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, June 1996.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. conf. Very Large DataBases (VLDB’94), pages 487–499, Santiago, Chile, September 1994.
R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of i/o intensive data mining applications on clusters of workstations. In Proceedings of Workshop on High Performance Data Mining IPDPS 2000, LNCS Volume 1800, pages 350–357. Springer Verlag, 2000.
P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases: A structured parallel approach. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1441–1450. Springer Verlag, August 1999.
W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, (12): 78–82, December 1996.
Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, and Shelby X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scientific Programming, 2(3), Fall 1993.
R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 1–10. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8.
Christan Borgelt. Apriori. http://fuzzy.cs.Uni-Magdeburg.de/borgelt/Software. Version 1.8.
C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizable parallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58–66, March 1998.
Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, April 1999.
Chialin Chang, Tahsin Kurc, Alan Sussman, and Joel Saltz. Query planning for range queries with user-defined aggregation on multi-dimensional scientific datasets. Technical Report CS-TR-3996 and UMIACS-TR-99-15, University of Maryland, Department of Computer Science and UMIACS, February 1999.
A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 187–196. ACM Press, March 1990.
John Darlington, Moustafa M. Ghanem, Yike Guo, and H. W. To. Performance models for co-ordinating parallel data classification. In Proceedings of the Seventh International Parallel Computing Workshop (PCW-97), Canberra, Australia, September 1997.
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47–56, August 1999.
M. W. Hall, S. Amarsinghe, B. R. Murphy, S. Liao, and Monica Lam. Detecting Course-Grain Parallelism using an Interprocedural Parallelizing Compiler. In Proceedings Supercomputing’ 95, December 1995.
E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. In Proceedings of ACM SIGMOD 1997, May 1997.
E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Data and Knowledge Engineering, 12(3), May / June 2000.
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
High Performance Fortran Forum. Hpf language specification, version 2.0. Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/.les/hpfv20.ps.gz, January 1997.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
M. Kandemir, A. Choudhary, and A. Choudhary. Compiler optimizations for i/o intensive computations. In Proceedings of International Conference on Parallel Processing, September 1999.
M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compilation techniques for out-of-core parallel computations. Parallel Computing, (3–4):597–628, June 1998.
M. Kandemir, A. Choudhary, J. Ramanujam, and M. A.. Kandaswamy. A unified framework for optimizing locality, parallelism, and comunication in out-of-core computations. IEEE Transactions on Parallel and Distributed Systems, 11(9):648–662, 2000.
M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the performance of out-of-core computations. In Proceedings of International Conference on Parallel Processing, August 1997.
Bo Lu and John Mellor-Crummey. Compiler optimization of implicit reductions for distributed memory multiprocessors. In Proceedings of the 12th International Parallel Processing Symposium (IPPS), April 1998.
William A. Maniatty and Mohammed J. Zaki. A requirements analysis for parallel kdd systems. In Proceedings of Workshop on High Performance Data Mining, IPDPS 2000, LNCS Volume 1800, pages 358–365. IEEE Computer Society Press, May 2000.
Jose E. Moreira, Samuel P. Midkiff, Manish Gupta, and Richard D. Lawrence. Parallel data mining in Java. Technical Report RC 21326, IBM T. J. Watson Research Center, November 1998.
Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI’ 96), Nov 1996.
S. K. Murthy. Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery, 2(4):345–389, 1998.
M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110–118. IEEE Computer Society Press, February 1995.
Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303–312, April 1990.
Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991.
David B. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, Oct–Dec 1999.
D.B. Skillicorn. Strategies for parallelizing data mining. In Proceedings of the Workshop on High-Performance Data Mining, in association with IPPS/SPDP 1998, April 1998.
Kilian Stoffel and Abdelkader Belkoniene. Parallel k/h-means clustering for large datasets. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1451–1454. Spring Verlag, August 1999.
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996.
Rajeev Thakur, Rajesh Bordawekar, and Alok Choudhary. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of the IPPS’94 Second Annual Workshop on Input/Output in Parallel Computer Systems, pages 54–72, April 1994. Also appears in ACM Computer Architecture News, Vol. 22, No. 4, September 1994.
F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3(3):121–189, September 1995.
Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed emory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737–753, June 1995.
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency Practice and Experience, 9(11), November 1998.
M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared memory multiprocessors. In Proceedings of Supercomputing’96, November 1996.
Mohammed J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4): 14–25, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Agrawal, G., Jin, R., Li, X. (2003). Compiler and Middleware Support for Scalable Data Mining. In: Dietz, H.G. (eds) Languages and Compilers for Parallel Computing. LCPC 2001. Lecture Notes in Computer Science, vol 2624. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-35767-X_3
Download citation
DOI: https://doi.org/10.1007/3-540-35767-X_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04029-3
Online ISBN: 978-3-540-35767-4
eBook Packages: Springer Book Archive