Abstract
Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve “out-of-core” problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme the compiler provides the crucial information on future access patterns without burdening the programmer; the operating system supports nonbinding prefetch and release hints for managing I/O; and the operating systems cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively insert prefetches ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We implemented our compiler analysis within the SUIF compiler, and used it to target implementations of our run-time and OS support on both research and commercial systems (Hurricane and IRIX 6.5, respectively). Our experimental results show large performance gains for out-of-core scientific applications on both systems: more than 50% of the I/O stall time has been eliminated in most cases, thus translating into overall speedups of roughly twofold in many cases.
- ARUNACHALAM, M., CHOUDHARY, A., AND RULLMAN, B. 1995. A prefetching prototype for the parallel file system on the Paragon. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMET- RICS '95/PERFORMANCE '95, Ottawa, Ontario, Canada, May 15-19), B. D. Gaither, Ed. ACM Press, New York, NY, 321-323. Extended abstract. Google Scholar
- BAILEY, D., BARTON, J., LASINSKI, T., AND SIMON, H. 1991. The NAS parallel benchmarks. RNR-91-002.Google Scholar
- BORDAWEKAR, R., CHOUDHARY, A., AND RAMANUJAM, J. 1996. Automatic optimization of communication in compiling out-of-core stencil codes. In Proceedings of the 1996 international conference on Supercomputing (ICS '96, Philadelphia, PA, May 25-28), P. C. Yew, Chair. ACM Press, New York, NY, 366-373. Google Scholar
- BROWN,A.D.AND MOWRY, T. C. 2000. Taming the memory hogs: Using compiler-inserted releases to manage physical memory intelligently. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (San Diego, CA). 31-44. Google Scholar
- CAO, P., FELTEN,E.W.,KARLIN,A.R.,AND LI, K. 1995. A study of integrated prefetching and caching strategies. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '95/PER-FORMANCE '95, Ottawa, Ontario, Canada, May 15-19), B. D. Gaither, Ed. ACM Press, New York, NY, 188-197. Google Scholar
- CHANG,F.AND GIBSON, G. 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (OSDI '99, New Orleans, LA., Feb.). USENIX Assoc., Berkeley, CA. Google Scholar
- CHEN,P.M.,LEE,E.K.,GIBSON,G.A.,KATZ,R.H.,AND PATTERSON, D. A. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June), 145-185. Google Scholar
- COLVIN,A.AND CORMEN, T. H. 1998. ViC*: A preprocessor for virtual-memory C*. In Proceedings of the Third International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'98, Orlando, FL, Mar.). Google Scholar
- CRANDALL,P.E.,AYDT,R.A.,CHIEN,A.A.,AND REED, D. A. 1995. Input/output characteristics of scalable parallel applications. In Proceedings of the 1995 Conference on Supercomputing (CD-ROM) (San Diego, CA, Dec. 3-8), S. Karin, Chair. ACM Press, New York, NY. Google Scholar
- CUREWITZ, K., KRISHNAN, P., AND VITTER, J. 1993. Practical prefetching via data compression. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93, Washington, DC, May 26-28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY, 43-53. Google Scholar
- DEL ROSARIO,J.M.AND CHOUDHARY, A. N. 1994. High-performance I/O for massively parallel computers: Problems and prospects. IEEE Computer 27, 3 (Mar.), 59-68. Google Scholar
- GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5 (Oct.), 587-616. Google Scholar
- GRIFFIOEN,J.AND APPLETON, R. 1994. Reducing file system latency using a predictive approach. In Proceedings of the Winter Conference on USENIX (Jan.). USENIX Assoc., Berkeley, CA, 197-208.Google Scholar
- GRIMSHAW,A.S.AND LOYOT,E.C.JR. 1991. ELFS: Object-oriented extensible file systems. In Proceedings of the First International Conference on Parallel and Distributed Information Systems (Miami Beach, FL, Dec.). 510-513. Google Scholar
- HUBER,J.V.,CHIEN,A.A.,ELFORD,C.L.,BLUMENTHAL,D.S.,AND REED, D. A. 1995. PPFS: A high performance portable parallel file system. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS '95, Barcelona, Spain, July 3-7), M. Valero, Chair. ACM Press, New York, NY, 385-394. Google Scholar
- IEEE. 1992. Threads extension for portable operating systems (Draft 7).Google Scholar
- KENNEDY, K., KOELBEL, C., AND PALECZNY, M. 1993. Scalable I/O for out-of-core structures. CRPC-TR93357-S. Center for Research on Parallel Computation, Rice University, Houston, TX.Google Scholar
- KIMBREL, T., TOMKINS, A., PATTERSON, R., BERSHAD, B., CAO, P., FELTEN, E., GIBSON, G., KARLIN, A., AND LI, K. 1996. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (Seattle, WA, Oct.). 19-34. Google Scholar
- KOTZ,D.AND ELLIS, C. S. 1990. Prefetching in file systems for MIMD multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 2 (Apr.), 218-230. Google Scholar
- KOTZ,D.AND ELLIS, C. S. 1993. Practical prefetching techniques for multiprocessor file systems. Distrib. Parallel Databases 1, 1 (Jan.), 33-51. Google Scholar
- KRIEGER,O.AND STUMM, M. 1997. HFS: A performance-oriented flexible file system based on building-block compositions. ACM Trans. Comput. Syst. 15, 3, 286-321. Google Scholar
- KRIEGER, O., STUMM, M., AND UNRAU, R. 1992. Exploiting the advantages of mapped files for stream I/O. In Proceedings of the 1992 Winter USENIX Conference (San Francisco, CA, Jan.). USENIX Assoc., Berkeley, CA, 27-42.Google Scholar
- KROEGER,T.M.AND LONG, D. D. E. 1996. Predicting file system actions from prior events. In Proceedings of the 1996 Technical Conference on USENIX (San Diego, CA, Jan.). USENIX Assoc., Berkeley, CA, 319-328. Google Scholar
- LAM, M. S. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '88, Atlanta, GA, June 22-24), R. L. Wexelblat, Ed. ACM Press, New York, NY, 318-328. Google Scholar
- LAUDON,J.AND LENOSKI, D. 1997. The SGI Origin2000: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Chairs. ACM Press, New York, NY, 241-251. Google Scholar
- MALKAWI,M.AND PATEL, J. 1985. Compiler directed management policy for numerical programs. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec.). 97-106. Google Scholar
- MOWRY, T. C. 1994. Tolerating latency through software-controlled data prefetching. Ph.D. Dissertation. Stanford University, Stanford, CA. Google Scholar
- MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Chair. ACM Press, New York, NY, 62-73. Google Scholar
- PALECZNY, M., KENNEDY, K., AND KOELBEL, C. 1995. Compiler support for out-of-core arrays on data parallel machines. In Proceedings of the Fifth Symposium on Frontiers of Massively Parallel Computation (McLean, VA, Feb.). 110-118. Google Scholar
- PARK, Y., SCOTT, R., AND SACHREST, S. 1996. Virtual memory versus file interfaces for large, memory-intensive scientific applications. In Proceedings of the Conference on Supercomputing (Pittsburgh, PA, Nov.). 17-22. Google Scholar
- PATTERSON,R.H.,GIBSON,G.A.,GINTING, E., STODOLSKY, D., AND ZELENKA, J. 1995. Informed prefetching and caching. In Proceedings of the 15th ACM Symposium on Operating System Principles (SOSP, Copper Mountain Resort, Colorado, U.S., 3-6 Dec.). ACM Press, New York, NY, 79-95. Google Scholar
- POOLE, J. T. 1994. Preliminary survey of I/O intensive applications. CCSF-38.Google Scholar
- SINGH,T.AND CHOUDHARY, A. 1994. ADOPT: A dynamic scheme for optimal prefetching in parallel file systems.Google Scholar
- SONG,I.AND CHO, Y. 1993. Page prefetching based on fault history. In Proceedings of the Third Mach Symposium on USENIX (Santa Fe, NM, Apr.). 203-213. Google Scholar
- SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON, C., NISHIMOTO, M., AND PECK, G. 1996. Scalability in the XFS file system. In Proceedings of the 1996 Technical Conference on USENIX (San Diego, CA, Jan.). USENIX Assoc., Berkeley, CA, 1-14. Google Scholar
- THAKUR, R., BORDAWEKAR, R., AND CHOUDHARY, A. 1994. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of IPPS '94 Workshop on Input/Output in Parallel Computer Systems (IPPS '94, Cancun, Mexico, Apr.). Syracuse University, Syracuse, NY, 54-72.Google Scholar
- THAKUR, R., BORDAWEKAR, R., CHOUDHARY, A., PONNUSAMY, R., AND SINGH, T. 1993. PASSION runtime library for parallel I/O. In Proceedings of the Conference on Scalable Parallel Libraries (Mississippi State University, Oct.), A. Skjellum, Ed. IEEE Computer Society, Washington, DC, 119-128.Google Scholar
- TJIANG,S.W.K.AND HENNESSY, J. L. 1992. Sharlit: A tool for building optimizers. In Proceedings of the 5th ACM SIGPLAN Conference on Programming Language Design and Implementation (SIGPLAN '92, San Francisco, CA, June 17-19), R. L. Wexelblat, Ed. ACM Press, New York, NY. Google Scholar
- TRIVEDI, K. 1977. On the paging performance of array algorithms. IEEE Trans. Comput. C-26, 10 (Oct.), 938-947.Google Scholar
- UNRAU,R.C.,KRIEGER, O., GAMSA, B., AND STUMM, M. 1995. Hierarchical clustering: A structure for scalable multiprocessor operating system design. J. Supercomput. 9, 1/2 (), 105-134. Google Scholar
- VRANESIC,Z.G.,STUMM, M., LEWIS,D.M.,AND WHITE, R. 1991. Hector: A hierarchically structured shared-memory multiprocessor. IEEE Computer 24, 1 (Jan.), 72-79. Google Scholar
- WOLF,M.E.AND LAM, M. S. 1991. A data locality optimization algorithm. In Proceedings of the ACM Conference on Programming Language Design and Implementation (SIGPLAN '91, Toronto, Ontario, Canada, June 26-28), D. S. Wise, Chair. ACM Press, New York, NY, 30-44. Google Scholar
- WOMBLE, D., GREENBERG, D., RIESEN, R., AND WHEAT, S. 1993. Out of core, out of mind: Practical parallel I/O. In Proceedings of the Conference on Scalable Parallel Libraries (Mississippi State University, Oct.), A. Skjellum, Ed. IEEE Computer Society, Washington, DC, 10-16.Google Scholar
Index Terms
- Compiler-based I/O prefetching for out-of-core applications
Recommendations
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
Special issue on cache memory and related problemsAs the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its ...
Tolerating latency in multiprocessors through compiler-inserted prefetching
The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing ...
Page Size Aware Cache Prefetching
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on MicroarchitectureThe increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching ...
Comments