Skip to main content
Log in

An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper describes a tiling technique that can be used by application programmers and optimizing compilers to obtain I/O-efficient versions of regular scientific loop nests. Due to the particular characteristics of I/O operations, a straightforward extension of the traditional tiling method to I/O-intensive programs may result in poor I/O performance. Therefore, the technique presented in this paper adapts iteration space tiling for I/O-performing loop nests to deliver high I/O performance. The generated code results in huge savings in the number of I/O calls as well as the volume of data transferred between the disk subsystem and main memory. Our experimental results on the IBM SP-2 distributed-memory message-passing multiprocessor demonstrate that the reduction in these two parameters, namely, the number of I/O calls and the transferred data volume, can lead to a marked decrease in overall execution times of I/O-intensive loop nests. In a number of loop nests extracted from several benchmarks and math libraries, we were able to improve the execution times by an average 42.5% for one data set and by an average 47.4% for another.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. W. Abu-sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5):341-356, 1981.

    Google Scholar 

  2. A. V. Aho, R. Sethi, and J. Ullman. Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, Reading, MA, 1986.

    Google Scholar 

  3. J. Anderson. Automatic Computation and Data Decomposition for Multiprocessors. Ph.D. dissertation, Stanford University, March 1997. Also available as technical report CSL-TR-97-179. Computer Systems Laboratory, Stanford University.

  4. J. Anderson, S. Amarasinghe, and M. Lam. Data and computation transformations for multiprocessors. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, July 1995.

  5. U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, 1988.

  6. A. Bik and H. Wijshoff. On a completion methodfor unimodular matrices. Technical report 94-14. Dept. of Computer Science, Leiden University, 1994.

  7. R. Bordawekar. Techniques for Compiling I/O Intensive Parallel Programs. Ph.D. Dissertation, ECE Dept., Syracuse University, Syracuse, NY, May 1996.

    Google Scholar 

  8. R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming, pages 1-10, July 1995.

  9. R. Bordawekar, A. Choudhary, and J. Ramanujam. Automatic optimization of communication in out-of-core stencil codes. In Proceedings of the 10th ACM International Conference on Supercomputing, pp. 366-373, May 1996.

  10. P. Brezany, T. A. Mueck, and E. Schikuta. Language, compiler and parallel database support for I/O intensive applications. In Proceedings of the High Performance Computing and Networking, May 1995.

  11. R. Carter, B. Ciotti, S. Fineberg, and B. Nitzberg, and B. Nitzberg. NHl-1 I/O benchmarks. Nas technical report, RND-92-016. Moffett Field, CA, November 1992.

    Google Scholar 

  12. Y. Chen, M. Winslett, Y. Cho, and S. Kuo. Automatic parallel I/O performance optimization in Panda. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1998.

  13. M. Cierniak and W. Li. Unifying data and control transformations for distributedsharedmemory machines. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.

  14. S. Coleman and K. McKinley. Tile size selection using cache organization and d ata layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.

  15. P. Corbett, S. Baylor, and D. Feitelson. Overview of the Vesta parallel file system. In Proceedings of the IPPS'93 Workshop on I/O in Parallel Computer Systems, pp. 1-16, April 1993.

  16. T. Cormen, and A. Colvin. ViC*: A preprocessor for virtual-memory C*. Dartmouth College Computer Science technical report PCS-TR94-243, November 1994.

  17. J. del Rosario and A. Choudhary. High performance I/O for parallel computers: problems and prospects. IEEE Computer, 27(3):59-68, 1994.

    Google Scholar 

  18. C. S. Ellis and D. Kotz. Prefetching in file systems for MIMD multiprocessors. In Proceedings of the 1989 International Conference on Parallel Processing, pp. 1:306-314, St. Charles, IL, August 1989. Pennsylvania State Univ. Press.

    Google Scholar 

  19. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. Journal of Parallel and Distributed Computing, 5:587-616, 1988.

    Google Scholar 

  20. Grand challenge applications projects. http://www.npac.syr.edu/crpc/.

  21. 21._ J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufmann Publishers, 1995.

  22. F. Irigoin and R. Triolet. Super-node partitioning. In Proceedings of the 15th Annual ACM Symp. Principles of Programming Languages, pp. 319-329, January 1988.

  23. M. Kandemir, R. Bordawekar, and A. Choudhary. Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines. In Proceedings of the International Parallel Processing Symposium, Geneva, Switzerland, April 1997.

  24. M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A matrix-based approach to the global locality optimization problem. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, October 1998.

  25. M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A hyperplane based approach for optimizing spatial locality in loop nests. In Proceedings of the 1998 ACM International Conference on Supercomputing, July 1998.

  26. I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multilevel blocking. In Proceedings of the SIGPLAN Conf. Programming Language Design and Implementation, June 1997.

  27. D. Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pp. 61-74. USENIX Association, Nov 1994.

  28. D. Kotz. Multiprocessor file system interfaces. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, pp. 194-201. IEEE Computer Society Press, 1993.

  29. M. Lam, E. Rothberg, and M. Wolf. The cache performance of blockedalgorithms. In Proceedings of the 4th Int. Conf. Architectural Support for Programming Languages and Operating Systems, April 1991.

  30. S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical report TR 95-09-01, Dept. Computer Science and Engineering, University of Washington, Sept. 1995.

  31. W. Li. Compiling for NUMA Parallel Machines. Ph.D. dissertation, Cornell University, 1993.

  32. T. Madhyastha and D. A. Reed. Input/output access pattern classi.cation using hidden Markov models. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, November 1997.

  33. K. McKinley, S. Carr, and C. W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996.

  34. T. Mowry, A. Demke, and O. Krieger. Automatic compiler-inserted I/O prefetching for our-of-core applications. In Proceedings of the Second Symposium on Operating Systems Design and Implementations, pp. 3-17, October 1996.

  35. M. O'Boyle and P. Knijnenburg. Non-singular data transformations: Definition, validity, applications. In Proceedings of the 6th Workshop on Compilers for Parallel Computers, pp. 287-297, 1996.

  36. M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the IEEE Symposium on The Frontiers of Massively Parallel Computation, pp. 110-118, February 1995.

  37. J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for multicomputer. Journal of Parallel and Distributed Computing, 16(2):108-120, 1992.

    Google Scholar 

  38. B. Rullman. Paragon parallel file system. External Product Specification, Intel Supercomputer Systems Division.

  39. SGIMI PSpro Fortran 77 Programmer's Guide, SGI Corporation (also available using the InSight tool on the SGI Origin 2000 machines).

  40. T. P. Singh and A. Choudhary. ADOPT: A dynamic scheme for optimal prefetching in parallel file systems. Technical report, NPAC, Syracuse, NY, June 1994

    Google Scholar 

  41. E. Smirni, R. Aydt, A. Chien, and D. Reed. I/O requirements of scienti.c applications: An evolutionary view. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, August 6–9, 1996, pp. 49-59.

  42. R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. PASSION: Optimized I/O for parallel applications. IEEE Computer, (26)6:70-78, 1996.

    Google Scholar 

  43. R. Thakur and A. Choudhary. An extended two-phase methodfor accessing sections of out-of-core arrays. Scienti.c Programming, (5)4:301-317, 1996.

    Google Scholar 

  44. R. Thakur, W. Gropp, and E. Lusk. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application. In Proceedings of the 3rd Int'l Conf. of the Austrian Center for Parallel Computation (ACPC), September 1996.

  45. R. Thakur, W. Gropp, and E. Lusk. On implementing MPI-IO portably and with high performance. Preprint ANL/MCS-P732-1098, Mathematics and Computer Science Division, Argonne National Laboratory, IL, October 1998.

    Google Scholar 

  46. S. Toledo and F. G. Gustavson. The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations, In Proceedings of the Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 1996.

  47. M. Wolf. Improving Locality and Parallelism in Nested Loops. Ph.D. dissertation, Stanford University, Computer Systems Laboratory, August, 1992.

  48. M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 91 Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.

  49. M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley, 1996.

  50. D. Womble, D. Greenberg, S. Wheat, and R. Riesen. Beyond core: Making parallel computer I/O practical. In Proceedings of the Dartmouth Institute for Advanced Graduate Studies, June 21–23, 1993.

  51. D. Womble, D. Greenberg, R. Riesen, and S. Wheat. Out of core, out of mind: Practical parallel I/O. In Proceedings of the Scalable Libraries Conference, Mississippi State University, Oct 6–8, pp. 10-16, 1993.

  52. J. Xue and C.-H. Huang. Reuse-driven tiling for data locality. In Z. Li et al., eds., Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, pp. 16-33, vol. 1366. Springer-Verlag, 1998.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kandemir, M., Choudhary, A. & Ramanujam, J. An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets. The Journal of Supercomputing 21, 257–284 (2002). https://doi.org/10.1023/A:1014156327748

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1014156327748

Navigation