Skip to main content

Compiler and Middleware Support for Scalable Data Mining

  • Conference paper
  • First Online:
Book cover Languages and Compilers for Parallel Computing (LCPC 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2624))

  • 410 Accesses

Abstract

The parallelizing compiler community has traditionally focused its efforts on scientific applications. This paper gives an overview of a compiler/runtime project targeting parallel and scalable execution of data mining algorithms. To the best of our knowledge, this is the first project with such a focus.

Data mining is the process of analyzing large datasets for extracting novel and useful patterns or models. Though a lot of effort has been put into developing parallel algorithms for data mining tasks, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is an impediment in the wide use of parallel computers for data mining.

We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk resident datasets. Our final goal is to start from declarative mining operators, and translate them to data parallel Java.

In this paper, we describe the commonality among different data mining algorithms, the middleware and its interface, the data parallel dialect of Java, and the compilation techniques required for generating the middleware specification. Experimental evaluations of the middleware and the compiler are also presented.

This work was supported by NSF grant ACR-9982097, NSF CAREER award ACI-9733520, and NSF grant CCR-980852.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. H. Agrawal, R. A. DeMillo, and E. H. Spafford. Dynamic slicing in the presence of unconstrained pointers. In Proceedings of the ACM Fourth Symposium on Testing, Analysis and Verification (TAV 4), pages 60–73, 1991.

    Google Scholar 

  2. R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, June 1996.

    Article  Google Scholar 

  3. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. conf. Very Large DataBases (VLDB’94), pages 487–499, Santiago, Chile, September 1994.

    Google Scholar 

  4. R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of i/o intensive data mining applications on clusters of workstations. In Proceedings of Workshop on High Performance Data Mining IPDPS 2000, LNCS Volume 1800, pages 350–357. Springer Verlag, 2000.

    Google Scholar 

  5. P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases: A structured parallel approach. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1441–1450. Springer Verlag, August 1999.

    Google Scholar 

  6. W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, (12): 78–82, December 1996.

    Google Scholar 

  7. Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, and Shelby X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scientific Programming, 2(3), Fall 1993.

    Google Scholar 

  8. R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 1–10. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8.

    Google Scholar 

  9. Christan Borgelt. Apriori. http://fuzzy.cs.Uni-Magdeburg.de/borgelt/Software. Version 1.8.

  10. C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizable parallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58–66, March 1998.

    Article  Google Scholar 

  11. Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, April 1999.

    Google Scholar 

  12. Chialin Chang, Tahsin Kurc, Alan Sussman, and Joel Saltz. Query planning for range queries with user-defined aggregation on multi-dimensional scientific datasets. Technical Report CS-TR-3996 and UMIACS-TR-99-15, University of Maryland, Department of Computer Science and UMIACS, February 1999.

    Google Scholar 

  13. A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 187–196. ACM Press, March 1990.

    Google Scholar 

  14. John Darlington, Moustafa M. Ghanem, Yike Guo, and H. W. To. Performance models for co-ordinating parallel data classification. In Proceedings of the Seventh International Parallel Computing Workshop (PCW-97), Canberra, Australia, September 1997.

    Google Scholar 

  15. Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47–56, August 1999.

    Google Scholar 

  16. M. W. Hall, S. Amarsinghe, B. R. Murphy, S. Liao, and Monica Lam. Detecting Course-Grain Parallelism using an Interprocedural Parallelizing Compiler. In Proceedings Supercomputing’ 95, December 1995.

    Google Scholar 

  17. E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. In Proceedings of ACM SIGMOD 1997, May 1997.

    Google Scholar 

  18. E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Data and Knowledge Engineering, 12(3), May / June 2000.

    Google Scholar 

  19. Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.

    Google Scholar 

  20. High Performance Fortran Forum. Hpf language specification, version 2.0. Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/.les/hpfv20.ps.gz, January 1997.

  21. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

    Google Scholar 

  22. M. Kandemir, A. Choudhary, and A. Choudhary. Compiler optimizations for i/o intensive computations. In Proceedings of International Conference on Parallel Processing, September 1999.

    Google Scholar 

  23. M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compilation techniques for out-of-core parallel computations. Parallel Computing, (3–4):597–628, June 1998.

    Google Scholar 

  24. M. Kandemir, A. Choudhary, J. Ramanujam, and M. A.. Kandaswamy. A unified framework for optimizing locality, parallelism, and comunication in out-of-core computations. IEEE Transactions on Parallel and Distributed Systems, 11(9):648–662, 2000.

    Article  Google Scholar 

  25. M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the performance of out-of-core computations. In Proceedings of International Conference on Parallel Processing, August 1997.

    Google Scholar 

  26. Bo Lu and John Mellor-Crummey. Compiler optimization of implicit reductions for distributed memory multiprocessors. In Proceedings of the 12th International Parallel Processing Symposium (IPPS), April 1998.

    Google Scholar 

  27. William A. Maniatty and Mohammed J. Zaki. A requirements analysis for parallel kdd systems. In Proceedings of Workshop on High Performance Data Mining, IPDPS 2000, LNCS Volume 1800, pages 358–365. IEEE Computer Society Press, May 2000.

    Google Scholar 

  28. Jose E. Moreira, Samuel P. Midkiff, Manish Gupta, and Richard D. Lawrence. Parallel data mining in Java. Technical Report RC 21326, IBM T. J. Watson Research Center, November 1998.

    Google Scholar 

  29. Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI’ 96), Nov 1996.

    Google Scholar 

  30. S. K. Murthy. Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery, 2(4):345–389, 1998.

    Article  Google Scholar 

  31. M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110–118. IEEE Computer Society Press, February 1995.

    Google Scholar 

  32. Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303–312, April 1990.

    Article  Google Scholar 

  33. Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991.

    Article  Google Scholar 

  34. David B. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, Oct–Dec 1999.

    Google Scholar 

  35. D.B. Skillicorn. Strategies for parallelizing data mining. In Proceedings of the Workshop on High-Performance Data Mining, in association with IPPS/SPDP 1998, April 1998.

    Google Scholar 

  36. Kilian Stoffel and Abdelkader Belkoniene. Parallel k/h-means clustering for large datasets. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1451–1454. Spring Verlag, August 1999.

    Google Scholar 

  37. R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996.

    Google Scholar 

  38. Rajeev Thakur, Rajesh Bordawekar, and Alok Choudhary. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of the IPPS’94 Second Annual Workshop on Input/Output in Parallel Computer Systems, pages 54–72, April 1994. Also appears in ACM Computer Architecture News, Vol. 22, No. 4, September 1994.

    Google Scholar 

  39. F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3(3):121–189, September 1995.

    Google Scholar 

  40. Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed emory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737–753, June 1995.

    Article  MATH  Google Scholar 

  41. K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency Practice and Experience, 9(11), November 1998.

    Google Scholar 

  42. M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared memory multiprocessors. In Proceedings of Supercomputing’96, November 1996.

    Google Scholar 

  43. Mohammed J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4): 14–25, 1999.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Agrawal, G., Jin, R., Li, X. (2003). Compiler and Middleware Support for Scalable Data Mining. In: Dietz, H.G. (eds) Languages and Compilers for Parallel Computing. LCPC 2001. Lecture Notes in Computer Science, vol 2624. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-35767-X_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-35767-X_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-04029-3

  • Online ISBN: 978-3-540-35767-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics