article

Compiler-based I/O prefetching for out-of-core applications

Authors:
Angela Demke Brown

Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA

Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
View Profile

,
Todd C. Mowry

Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA

Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
View Profile

,
Orran Krieger

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 19 Issue 2pp 111–170https://doi.org/10.1145/377769.377774

Published:01 May 2001Publication History

ACM Transactions on Computer Systems

Abstract

Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve “out-of-core” problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme the compiler provides the crucial information on future access patterns without burdening the programmer; the operating system supports nonbinding prefetch and release hints for managing I/O; and the operating systems cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively insert prefetches ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We implemented our compiler analysis within the SUIF compiler, and used it to target implementations of our run-time and OS support on both research and commercial systems (Hurricane and IRIX 6.5, respectively). Our experimental results show large performance gains for out-of-core scientific applications on both systems: more than 50% of the I/O stall time has been eliminated in most cases, thus translating into overall speedups of roughly twofold in many cases.

References

ARUNACHALAM, M., CHOUDHARY, A., AND RULLMAN, B. 1995. A prefetching prototype for the parallel file system on the Paragon. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMET- RICS '95/PERFORMANCE '95, Ottawa, Ontario, Canada, May 15-19), B. D. Gaither, Ed. ACM Press, New York, NY, 321-323. Extended abstract. Google Scholar
BAILEY, D., BARTON, J., LASINSKI, T., AND SIMON, H. 1991. The NAS parallel benchmarks. RNR-91-002.Google Scholar
BORDAWEKAR, R., CHOUDHARY, A., AND RAMANUJAM, J. 1996. Automatic optimization of communication in compiling out-of-core stencil codes. In Proceedings of the 1996 international conference on Supercomputing (ICS '96, Philadelphia, PA, May 25-28), P. C. Yew, Chair. ACM Press, New York, NY, 366-373. Google Scholar
BROWN,A.D.AND MOWRY, T. C. 2000. Taming the memory hogs: Using compiler-inserted releases to manage physical memory intelligently. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (San Diego, CA). 31-44. Google Scholar
CAO, P., FELTEN,E.W.,KARLIN,A.R.,AND LI, K. 1995. A study of integrated prefetching and caching strategies. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '95/PER-FORMANCE '95, Ottawa, Ontario, Canada, May 15-19), B. D. Gaither, Ed. ACM Press, New York, NY, 188-197. Google Scholar
CHANG,F.AND GIBSON, G. 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (OSDI '99, New Orleans, LA., Feb.). USENIX Assoc., Berkeley, CA. Google Scholar
CHEN,P.M.,LEE,E.K.,GIBSON,G.A.,KATZ,R.H.,AND PATTERSON, D. A. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June), 145-185. Google Scholar
COLVIN,A.AND CORMEN, T. H. 1998. ViC*: A preprocessor for virtual-memory C*. In Proceedings of the Third International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'98, Orlando, FL, Mar.). Google Scholar
CRANDALL,P.E.,AYDT,R.A.,CHIEN,A.A.,AND REED, D. A. 1995. Input/output characteristics of scalable parallel applications. In Proceedings of the 1995 Conference on Supercomputing (CD-ROM) (San Diego, CA, Dec. 3-8), S. Karin, Chair. ACM Press, New York, NY. Google Scholar
CUREWITZ, K., KRISHNAN, P., AND VITTER, J. 1993. Practical prefetching via data compression. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93, Washington, DC, May 26-28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY, 43-53. Google Scholar
DEL ROSARIO,J.M.AND CHOUDHARY, A. N. 1994. High-performance I/O for massively parallel computers: Problems and prospects. IEEE Computer 27, 3 (Mar.), 59-68. Google Scholar
GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5 (Oct.), 587-616. Google Scholar
GRIFFIOEN,J.AND APPLETON, R. 1994. Reducing file system latency using a predictive approach. In Proceedings of the Winter Conference on USENIX (Jan.). USENIX Assoc., Berkeley, CA, 197-208.Google Scholar
GRIMSHAW,A.S.AND LOYOT,E.C.JR. 1991. ELFS: Object-oriented extensible file systems. In Proceedings of the First International Conference on Parallel and Distributed Information Systems (Miami Beach, FL, Dec.). 510-513. Google Scholar
HUBER,J.V.,CHIEN,A.A.,ELFORD,C.L.,BLUMENTHAL,D.S.,AND REED, D. A. 1995. PPFS: A high performance portable parallel file system. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS '95, Barcelona, Spain, July 3-7), M. Valero, Chair. ACM Press, New York, NY, 385-394. Google Scholar
IEEE. 1992. Threads extension for portable operating systems (Draft 7).Google Scholar
KENNEDY, K., KOELBEL, C., AND PALECZNY, M. 1993. Scalable I/O for out-of-core structures. CRPC-TR93357-S. Center for Research on Parallel Computation, Rice University, Houston, TX.Google Scholar
KIMBREL, T., TOMKINS, A., PATTERSON, R., BERSHAD, B., CAO, P., FELTEN, E., GIBSON, G., KARLIN, A., AND LI, K. 1996. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (Seattle, WA, Oct.). 19-34. Google Scholar
KOTZ,D.AND ELLIS, C. S. 1990. Prefetching in file systems for MIMD multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 2 (Apr.), 218-230. Google Scholar
KOTZ,D.AND ELLIS, C. S. 1993. Practical prefetching techniques for multiprocessor file systems. Distrib. Parallel Databases 1, 1 (Jan.), 33-51. Google Scholar
KRIEGER,O.AND STUMM, M. 1997. HFS: A performance-oriented flexible file system based on building-block compositions. ACM Trans. Comput. Syst. 15, 3, 286-321. Google Scholar
KRIEGER, O., STUMM, M., AND UNRAU, R. 1992. Exploiting the advantages of mapped files for stream I/O. In Proceedings of the 1992 Winter USENIX Conference (San Francisco, CA, Jan.). USENIX Assoc., Berkeley, CA, 27-42.Google Scholar
KROEGER,T.M.AND LONG, D. D. E. 1996. Predicting file system actions from prior events. In Proceedings of the 1996 Technical Conference on USENIX (San Diego, CA, Jan.). USENIX Assoc., Berkeley, CA, 319-328. Google Scholar
LAM, M. S. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '88, Atlanta, GA, June 22-24), R. L. Wexelblat, Ed. ACM Press, New York, NY, 318-328. Google Scholar
LAUDON,J.AND LENOSKI, D. 1997. The SGI Origin2000: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Chairs. ACM Press, New York, NY, 241-251. Google Scholar
MALKAWI,M.AND PATEL, J. 1985. Compiler directed management policy for numerical programs. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec.). 97-106. Google Scholar
MOWRY, T. C. 1994. Tolerating latency through software-controlled data prefetching. Ph.D. Dissertation. Stanford University, Stanford, CA. Google Scholar
MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Chair. ACM Press, New York, NY, 62-73. Google Scholar
PALECZNY, M., KENNEDY, K., AND KOELBEL, C. 1995. Compiler support for out-of-core arrays on data parallel machines. In Proceedings of the Fifth Symposium on Frontiers of Massively Parallel Computation (McLean, VA, Feb.). 110-118. Google Scholar
PARK, Y., SCOTT, R., AND SACHREST, S. 1996. Virtual memory versus file interfaces for large, memory-intensive scientific applications. In Proceedings of the Conference on Supercomputing (Pittsburgh, PA, Nov.). 17-22. Google Scholar
PATTERSON,R.H.,GIBSON,G.A.,GINTING, E., STODOLSKY, D., AND ZELENKA, J. 1995. Informed prefetching and caching. In Proceedings of the 15th ACM Symposium on Operating System Principles (SOSP, Copper Mountain Resort, Colorado, U.S., 3-6 Dec.). ACM Press, New York, NY, 79-95. Google Scholar
POOLE, J. T. 1994. Preliminary survey of I/O intensive applications. CCSF-38.Google Scholar
SINGH,T.AND CHOUDHARY, A. 1994. ADOPT: A dynamic scheme for optimal prefetching in parallel file systems.Google Scholar
SONG,I.AND CHO, Y. 1993. Page prefetching based on fault history. In Proceedings of the Third Mach Symposium on USENIX (Santa Fe, NM, Apr.). 203-213. Google Scholar
SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON, C., NISHIMOTO, M., AND PECK, G. 1996. Scalability in the XFS file system. In Proceedings of the 1996 Technical Conference on USENIX (San Diego, CA, Jan.). USENIX Assoc., Berkeley, CA, 1-14. Google Scholar
THAKUR, R., BORDAWEKAR, R., AND CHOUDHARY, A. 1994. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of IPPS '94 Workshop on Input/Output in Parallel Computer Systems (IPPS '94, Cancun, Mexico, Apr.). Syracuse University, Syracuse, NY, 54-72.Google Scholar
THAKUR, R., BORDAWEKAR, R., CHOUDHARY, A., PONNUSAMY, R., AND SINGH, T. 1993. PASSION runtime library for parallel I/O. In Proceedings of the Conference on Scalable Parallel Libraries (Mississippi State University, Oct.), A. Skjellum, Ed. IEEE Computer Society, Washington, DC, 119-128.Google Scholar
TJIANG,S.W.K.AND HENNESSY, J. L. 1992. Sharlit: A tool for building optimizers. In Proceedings of the 5th ACM SIGPLAN Conference on Programming Language Design and Implementation (SIGPLAN '92, San Francisco, CA, June 17-19), R. L. Wexelblat, Ed. ACM Press, New York, NY. Google Scholar
TRIVEDI, K. 1977. On the paging performance of array algorithms. IEEE Trans. Comput. C-26, 10 (Oct.), 938-947.Google Scholar
UNRAU,R.C.,KRIEGER, O., GAMSA, B., AND STUMM, M. 1995. Hierarchical clustering: A structure for scalable multiprocessor operating system design. J. Supercomput. 9, 1/2 (), 105-134. Google Scholar
VRANESIC,Z.G.,STUMM, M., LEWIS,D.M.,AND WHITE, R. 1991. Hector: A hierarchically structured shared-memory multiprocessor. IEEE Computer 24, 1 (Jan.), 72-79. Google Scholar
WOLF,M.E.AND LAM, M. S. 1991. A data locality optimization algorithm. In Proceedings of the ACM Conference on Programming Language Design and Implementation (SIGPLAN '91, Toronto, Ontario, Canada, June 26-28), D. S. Wise, Chair. ACM Press, New York, NY, 30-44. Google Scholar
WOMBLE, D., GREENBERG, D., RIESEN, R., AND WHEAT, S. 1993. Out of core, out of mind: Practical parallel I/O. In Proceedings of the Conference on Scalable Parallel Libraries (Mississippi State University, Oct.), A. Skjellum, Ed. IEEE Computer Society, Washington, DC, 10-16.Google Scholar

Index Terms

Compiler-based I/O prefetching for out-of-core applications
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory
    2. Extra-functional properties
      1. Software performance

Recommendations

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
Special issue on cache memory and related problems

As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its ...
Read More
Tolerating latency in multiprocessors through compiler-inserted prefetching

The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing ...
Read More
Page Size Aware Cache Prefetching
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching ...
Read More

Reviews

Reviewer: Ted Brown

Applications that need very large arrays which the access to the elements of the array in mostly sequential order, can improve their run times by prefetching out-of-core pages into memory. These applications are often scientific numerical applications. This paper clearly lays out an automated aid built on top of a virtual paged memory. The problem is complex: prefetching pages too early reduces the size of effective memory, prefetching pages too late slows up the processing. It might be argued that the application programmer is in the best seat to write these commands. They make the case that it is not only onerous for a programmer to be responsible for adding prefetching instructions, but the size of main memory, speed of I/O devices, etc. cannot nor should be the programmer's concern, and just as important it makes the code less portable, as changes to the hardware can effect the efficiency of the prefetching. The authors' solution is to automate the insertion of prefetch commands into the application code and have the application program interface with operating system during run time for final decisions about whether to do the prefetching or not. Consequently the authors needed to make modifications to the compiler, the I/O part of the operating system, and the operating system's memory manager component. The paper, almost 60 pages, is exceptionally long for a journal. The reason that I quickly saw is this is a must read paper if one is doing work in the area. It is clearly written and has a number of nicely thought out practices. For example, the compiler provides a guess of the future access patterns of data and inserts prefetch and release. But these are nonbinding performance hints; at run time it is up to the operating system layer then to make what is thinks is most effective decisions at the time these occur. It must decide if by prefetching a page it could be removing a page that may still be needed and may even be needed before the requested page. In the authors' system the application works closely with the operating system to make prefetching decisions, as they point out it is the application itself that should be making these decisions as it (should) know this best, whereas the operating system knows memory usage. The paper is written in a layed format. Details are increased three times. First an outline is given, then a justification for the augmentation to a system and an overview of the components. Finally in the longest sections a detailed description of the components of the system. Each is well written. The authors evaluate their ideas on two operating systems using NAS Parallel benchmarks (nine applications) and find a large speedup of roughly two-fold in many cases.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 19, Issue 2
May 2001
171 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/377769
Issue’s Table of Contents

Copyright © 2001 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Published in tocs Volume 19, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compiler optimization
prefetching
virtual memory
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 1,317
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

Tolerating latency in multiprocessors through compiler-inserted prefetching

Page Size Aware Cache Prefetching

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

Tolerating latency in multiprocessors through compiler-inserted prefetching

Page Size Aware Cache Prefetching

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media