Compiler and Middleware Support for Scalable Data Mining

Agrawal, Gagan; Jin, Ruoming; Li, Xiaogang

doi:10.1007/3-540-35767-X_3

Gagan Agrawal⁵,
Ruoming Jin⁵ &
Xiaogang Li⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2624))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

410 Accesses

Abstract

The parallelizing compiler community has traditionally focused its efforts on scientific applications. This paper gives an overview of a compiler/runtime project targeting parallel and scalable execution of data mining algorithms. To the best of our knowledge, this is the first project with such a focus.

Data mining is the process of analyzing large datasets for extracting novel and useful patterns or models. Though a lot of effort has been put into developing parallel algorithms for data mining tasks, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is an impediment in the wide use of parallel computers for data mining.

We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk resident datasets. Our final goal is to start from declarative mining operators, and translate them to data parallel Java.

In this paper, we describe the commonality among different data mining algorithms, the middleware and its interface, the data parallel dialect of Java, and the compilation techniques required for generating the middleware specification. Experimental evaluations of the middleware and the compiler are also presented.

This work was supported by NSF grant ACR-9982097, NSF CAREER award ACI-9733520, and NSF grant CCR-980852.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Agrawal, R. A. DeMillo, and E. H. Spafford. Dynamic slicing in the presence of unconstrained pointers. In Proceedings of the ACM Fourth Symposium on Testing, Analysis and Verification (TAV 4), pages 60–73, 1991.
Google Scholar
R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, June 1996.
Article Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. conf. Very Large DataBases (VLDB’94), pages 487–499, Santiago, Chile, September 1994.
Google Scholar
R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of i/o intensive data mining applications on clusters of workstations. In Proceedings of Workshop on High Performance Data Mining IPDPS 2000, LNCS Volume 1800, pages 350–357. Springer Verlag, 2000.
Google Scholar
P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases: A structured parallel approach. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1441–1450. Springer Verlag, August 1999.
Google Scholar
W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, (12): 78–82, December 1996.
Google Scholar
Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, and Shelby X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scientific Programming, 2(3), Fall 1993.
Google Scholar
R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 1–10. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8.
Google Scholar
Christan Borgelt. Apriori. http://fuzzy.cs.Uni-Magdeburg.de/borgelt/Software. Version 1.8.
C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizable parallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58–66, March 1998.
Article Google Scholar
Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, April 1999.
Google Scholar
Chialin Chang, Tahsin Kurc, Alan Sussman, and Joel Saltz. Query planning for range queries with user-defined aggregation on multi-dimensional scientific datasets. Technical Report CS-TR-3996 and UMIACS-TR-99-15, University of Maryland, Department of Computer Science and UMIACS, February 1999.
Google Scholar
A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 187–196. ACM Press, March 1990.
Google Scholar
John Darlington, Moustafa M. Ghanem, Yike Guo, and H. W. To. Performance models for co-ordinating parallel data classification. In Proceedings of the Seventh International Parallel Computing Workshop (PCW-97), Canberra, Australia, September 1997.
Google Scholar
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47–56, August 1999.
Google Scholar
M. W. Hall, S. Amarsinghe, B. R. Murphy, S. Liao, and Monica Lam. Detecting Course-Grain Parallelism using an Interprocedural Parallelizing Compiler. In Proceedings Supercomputing’ 95, December 1995.
Google Scholar
E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. In Proceedings of ACM SIGMOD 1997, May 1997.
Google Scholar
E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Data and Knowledge Engineering, 12(3), May / June 2000.
Google Scholar
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
Google Scholar
High Performance Fortran Forum. Hpf language specification, version 2.0. Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/.les/hpfv20.ps.gz, January 1997.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
Google Scholar
M. Kandemir, A. Choudhary, and A. Choudhary. Compiler optimizations for i/o intensive computations. In Proceedings of International Conference on Parallel Processing, September 1999.
Google Scholar
M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compilation techniques for out-of-core parallel computations. Parallel Computing, (3–4):597–628, June 1998.
Google Scholar
M. Kandemir, A. Choudhary, J. Ramanujam, and M. A.. Kandaswamy. A unified framework for optimizing locality, parallelism, and comunication in out-of-core computations. IEEE Transactions on Parallel and Distributed Systems, 11(9):648–662, 2000.
Article Google Scholar
M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the performance of out-of-core computations. In Proceedings of International Conference on Parallel Processing, August 1997.
Google Scholar
Bo Lu and John Mellor-Crummey. Compiler optimization of implicit reductions for distributed memory multiprocessors. In Proceedings of the 12th International Parallel Processing Symposium (IPPS), April 1998.
Google Scholar
William A. Maniatty and Mohammed J. Zaki. A requirements analysis for parallel kdd systems. In Proceedings of Workshop on High Performance Data Mining, IPDPS 2000, LNCS Volume 1800, pages 358–365. IEEE Computer Society Press, May 2000.
Google Scholar
Jose E. Moreira, Samuel P. Midkiff, Manish Gupta, and Richard D. Lawrence. Parallel data mining in Java. Technical Report RC 21326, IBM T. J. Watson Research Center, November 1998.
Google Scholar
Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI’ 96), Nov 1996.
Google Scholar
S. K. Murthy. Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery, 2(4):345–389, 1998.
Article Google Scholar
M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110–118. IEEE Computer Society Press, February 1995.
Google Scholar
Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303–312, April 1990.
Article Google Scholar
Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991.
Article Google Scholar
David B. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, Oct–Dec 1999.
Google Scholar
D.B. Skillicorn. Strategies for parallelizing data mining. In Proceedings of the Workshop on High-Performance Data Mining, in association with IPPS/SPDP 1998, April 1998.
Google Scholar
Kilian Stoffel and Abdelkader Belkoniene. Parallel k/h-means clustering for large datasets. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1451–1454. Spring Verlag, August 1999.
Google Scholar
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996.
Google Scholar
Rajeev Thakur, Rajesh Bordawekar, and Alok Choudhary. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of the IPPS’94 Second Annual Workshop on Input/Output in Parallel Computer Systems, pages 54–72, April 1994. Also appears in ACM Computer Architecture News, Vol. 22, No. 4, September 1994.
Google Scholar
F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3(3):121–189, September 1995.
Google Scholar
Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed emory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737–753, June 1995.
Article MATH Google Scholar
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency Practice and Experience, 9(11), November 1998.
Google Scholar
M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared memory multiprocessors. In Proceedings of Supercomputing’96, November 1996.
Google Scholar
Mohammed J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4): 14–25, 1999.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19716
Gagan Agrawal, Ruoming Jin & Xiaogang Li

Authors

Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Ruoming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical and Computer Engineering Department, University of Kentucky, Lexington, KY, 40506-0046, USA
Henry G. Dietz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agrawal, G., Jin, R., Li, X. (2003). Compiler and Middleware Support for Scalable Data Mining. In: Dietz, H.G. (eds) Languages and Compilers for Parallel Computing. LCPC 2001. Lecture Notes in Computer Science, vol 2624. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-35767-X_3

Download citation

DOI: https://doi.org/10.1007/3-540-35767-X_3
Published: 13 May 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04029-3
Online ISBN: 978-3-540-35767-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics