Abstract
Memory bandwidth is rapidly becoming the performance bottleneck in the application of high performance microprocessors to vector-like algorithms, including the “Grand Challenge” scientific problems. Caching is not the sole solution for these applications due to the poor temporal and spatial locality of their data accesses. Moreover, the nature of memories themselves has changed. Achieving greater bandwidth requires exploiting the characteristics of memory components “on the other side of the cache” — they should not be treated as uniform access-time RAM. This paper describes the use of hardwareassisted access ordering, a technique that combines compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system. This decoupling permits the requests to be issued in an order that optimizes use of the memory system. Our simulations show significant speedup on important scientific kernels.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baer, J. L., Chen, T. F., “An Effective On-Chip Preloading Scheme To Reduce Data Access Penalty”, Supercomputing 91, November 1991.
Baron, R.L., and Higbie, L., Computer Architecture, Addison-Wesley, 1992.
Budnik, P., and Kuck, D., “The Organization and Use of Parallel Memories”, IEEE Trans. Comput., 20, 12, 1971.
Callahan, D., et. al., “Software Prefetching”, Fourth International Conference on Architectural Support for Programming Languages and Systems, April 1991.
Carr, S., Kennedy, K., “Blocking Linear Algebra Codes for Memory Hierarchies”, Proc. Fourth SIAM Conference on Parallel Processing for Scientific Computing, 1989.
Davidson, Jack W., and Benitez, Manuel E., “Code Generation for Streaming: An Access/Execute Mechanism”, Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991.
Dongarra, et. al., “Linpack User's Guide“, SLAM, Philadelphia, 1979.
Fu, J. W. C., and Patel, J. H., “Data Prefetching in Multiprocessor Vector Cache Memories”, 18th International Symposium on Computer Architecture, May 1991.
Golub, G., and Ortega, J.M., Scientific Computation: An Introduction with Parallel Computing, Academic Press, Inc., 1993.
Goodman, J. R., et al, “PIPE: A VLSI Decoupled Architecture”, Twelfth International Symposium on Computer Architecture, June 1985.
Gupta, R., and Soffa, M., “Compile-time Techniques for Efficient Utilization of Parallel Memories”, SIGPLAN Not., 23, 9, 1988, pp. 235–246.
Harper, D. T., Jump., J., “Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme”, IEEE Trans. Comput., 36, 12, 1987.
Harper, D. T., “Address Transformation to Increase Memory Performance”, 1989 International Conference on Supercomputing.
Hayes, J.P., Computer Architecture and Organization, McGraw-Hill, 1988.
Hwang, K., and Briggs, F.A., Computer Architecture and Parallel Processing, McGraw-Hill, Inc., 1984.
“High-speed DRAMs”, Special Report, IEEE Spectrum, vol. 29, no. 10, October 1992.
i860 XP Microprocessor Data Book, Intel Corporation, 1991.
Jouppi, N., “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers”, 17th International Symposium on Computer Architecture, May 1990.
Katz, R., and Hennessy, J., “High Performance Microprocessor Architectures”, University of California, Berkeley, Report No. UCB/CSD 89/529, August, 1989.
Klaiber, A., et. al., “An Architecture for Software-Controlled Data Prefetching”, 18th International Symposium on Computer Architecture, May 1991.
Lam, Monica, et. al., “The Cache Performance and Optimizations of Blocked Algorithms”, Fourth International Conference on Architectural Support for Programming Languages and Systems, April 1991.
Lawson, et. al., “Basic Linear Algebra Subprograms for Fortran Usage”, ACM Trans. Math. Soft., 5, 3, 1979.
Lee, K., “Achieving High Performance On the i860 Microprocessor Using Naspack Subroutines”, NAS Systems Division, NASA Ames Research Center, July 1990.
Lee, K., “On the Floating Point Performance of the i860 Microprocessor”, RNR-90-019, NAS Systems Division, NASA Ames Research Center, October 1990.
Maccabe, A.B., Computer Systems: Architecture, Organization, and Programming, Richard D. Irwin, Inc., 1993.
Mano, M.M., Computer System Architecture, 2nd ed., Prentice-Hall, Inc., 1982
McMahon, F.H., “The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range”, Lawrence Livermore National Laboratory, UCRL-53745, December 1986.
McKee, S.A, “Hardware Support for Access Ordering: Performance of Some Design Options”, University of Virginia, Department of Computer Science, Technical Report CS-93-08, July 1993.
Meadows, L., Nakamoto, S., and Schuster, V., “A Vectorizing, Software Pipelining Compiler for LIW and Superscalar Architectures”, RISC'92, February 1992.
Moyer, S.A., “Performance of the iPSC/860 Node Architecture,” University of Virginia, IPC-TR-91-007, 1991.
Moyer, S., “Access Ordering and Effective Memory Bandwidth”, Ph.D. Dissertation, Department of Computer Science, University of Virginia, Technical Report CS-93-18, April 1993.
Quinnell, R., “High-speed DRAMs”, EDN, May 23, 1991.
“Architectural Overview”, Rambus Inc., Mountain View, CA, 1992.
Rau, B. R., “Pseudo-Randomly Interleaved Memory”, 18th International Symposium on Computer Architecture, May 1991.
Sklenar, Ivan, “Prefetch Unit for Vector Operation on Scalar Computers”, Computer Architecture News, 20, 4, September 1992.
Smith, J. E., et al, “The ZS-1 Central Processor”, The Second International Conference on Architectural Support for Programming Languages and Systems, Oct. 1987
Sohi, G. and Manoj, F., “High Bandwidth Memory Systems for Superscalar Processors”, Fourth International Conference on Architectural Support for Programming Languages and Systems, April 1991.
Tomek, I., The Foundations of Computer Architecture and Organization, Computer Science Press, 1990.
Valero, M., et. al., “Increasing the Number of Strides for Conflict-Free Vector Access”, 19th International Symposium on Computer Architecture, May 1992.
Wallach, S., “The CONVEX C-1 64-bit Supercomputer”, Compcon Spring 85, February 1985.
Wolfe, M., “Optimizing Supercompilers for Supercomputers”, MIT Press, Cambridge, MA, 1989.
Wulf, W. A., “Evaluation of the WM Architecture”, 19th Annual International Symposium on Computer Architecture, vol 20, no. 2, May 19–21, 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McKee, S.A., Moyer, S.A., Wulf, W.A., Hitchcock, C. (1994). Increasing memory bandwidth for vector computations. In: Gutknecht, J. (eds) Programming Languages and System Architectures. Lecture Notes in Computer Science, vol 782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57840-4_26
Download citation
DOI: https://doi.org/10.1007/3-540-57840-4_26
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57840-6
Online ISBN: 978-3-540-48356-4
eBook Packages: Springer Book Archive