A Simulation Study of Decoupled Vector Architectures

Espasa, Roger; Valero, Mateo

doi:10.1023/A:1008158808410

A Simulation Study of Decoupled Vector Architectures

Published: September 1999

Volume 14, pages 124–152, (1999)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Roger Espasa¹ &
Mateo Valero¹

41 Accesses
3 Citations
Explore all metrics

Abstract

Decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. We simulate a selection of the Perfect Club and Specfp92 benchmark suites and compare their execution time on a conventional single port vector architecture with that of a decoupled vector architecture. Decoupling increases the performance by a factor greater than 1.4 for realistic memory latencies, and for an ideal memory system with zero latency, there is still a speedup of as much as 1.3. A significant portion of this paper is devoted to studying the tradeoffs involved in choosing a suitable size for the queues of the decoupled architecture. The hardware cost of the queues need not be large to achieve most of the performance advantages of decoupling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Compiler Optimization of Multiple Vector Loads/Stores

Article 09 January 2017

The use of vector instructions of a processor architecture for emulating the vector instructions of another processor architecture

Article 01 November 2017

ParVec: vectorizing the PARSEC benchmark suite

Article 17 February 2015

References

A. Agarwal. Performance tradeoffs in multithreaded processors. IEEE Transactions on Parallel and Distributed Systems, 3(5):525–539, 1992.
Google Scholar
Bradley J. Benschneider, Andrew J. Black, William J. Bowhill, Sharon M. Britton, Daniel E. Dever, Dale R. Donchin, Robert J. Dupack, Richard M. Fromm, Mary K. Gowan, Paul E. Gronowski, Michael Kantrowitz, Marc E. Lamere, Shekhar Metha, Jeanne E. Meyer, Robert O. Mueller, Andy Olesin, Ronald P. Preston, Donald A. Priore, Sribalan Santhanam, Michael J. Smith, and Gilber M. Wolrich. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. IEEE Journal of Solid-State Circuits, 30(11):1203–1214, 1995.
Google Scholar
W. C. Brantley and Joseph Weiss. Organization and architecture tradeoffs in FOM. In IEEE International Workshop on Computer Systems Organization, March 1983.
Tien-Fu Chen and Jean-Loup Baer. A performance study of software and hardware data prefetching strategies. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 223–232, 1994.
E. U. Cohler and J. E. Storer. Functionally parallel architectures for array processors. Computer, 14:28–36, 1981.
Google Scholar
Convex Press, Richardson, Texas, USA. CONVEX Architecture Reference Manual (C Series), 6th edition, 1992.
R. Espasa, M. Valero, D. Padua, M. Jiménez, and E. Ayguadé. Quantitative analysis of vector code. In Euromicro Workshop on Parallel and Distributed Processing. IEEE Computer Society Press, 1995.
Roger Espasa and Mateo Valero. Decoupled vector architectures. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, pp. 281–290. IEEE Computer Society Press, 1996.
Roger Espasa and Mateo Valero. Multithreaded vector architectures. In Proceedings of the 3rd International Symposium on High Performance Computer Architecture, pp. 237–249. IEEE Computer Society Press, 1997.
Roger Espasa, Mateo Valero, and James E. Smith. Out-of-order vector architectures. In MICRO-30, pp. 160–170. IEEE Press, 1997.
J. R. Goodman, J. T. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young. PIPE: A VLSI Decoupled Architecture. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 20–27, June 1985.
P. Y. T. Hsu. Designing the TFP microprocessor. IEEE Micro, 14(2):23–33, 1994.
Google Scholar
Ken Kennedy and Kathryn S. McKinley. Optimizing for parallelism and data locality. In Proceedings of the International Conference on Supercomputing, pp. 323–334, 1992.
L. I. Kontothanassis, R. A. Sugumar, G. J. Faanes, J. E. Smith, and M. L. Scott. Cache performance in vector supercomputers. In Proceedings of Supercomputing '94, Washington DC, November 1994. IEEE Computer Society Press.
Lizy Kurian, Paul T. Hulina, and Lee D. Coraor. Memory latency effects in decoupled architectures. IEEE Transactions on Computers, 43(10):1129–1139, 1994.
Google Scholar
M. S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. SIGPLAN Notices, 23(7):318–328, 1988.
Google Scholar
J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, 17(1):6–22, 1984.
Google Scholar
W. Mangione-Smith, S. G. Abraham, and E. S. Davidson. Vector register design for polycyclic vector scheduling. In 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 154–163, Santa Clara, CA, 1991.
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
Willi Schönauer and Hartmut Häfner. Explaining the gap between theoretical peak performance and real performance for supercomputer architectures. Scientific Programming, 3:157–168, 1994.
Google Scholar
James E. Smith. Decoupled Access/Execute Computer Architectures. ACM Transactions on Computer Systems, 2:289–308, 1984.
Google Scholar
James E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski, D. L. Fowler, K. R. Scidmore, and J. P. Laudon. The ZS-1 central processor. In 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 199–204. CS Press, 1987.
James E. Smith, Shlomo Weiss, and Nicholas Y. Pang. A simulation study of decoupled architecture computers. IEEE Transactions on Computers, C-35(8):692–702, 1986.
Google Scholar
Juho Tang, Edward S. Davidson, and Johau Tong. Polycyclic vector scheduling vs. chaining on 1-port vector supercomputers. In Proceedings of Supercomputing '88, pp. 122–129, Orlando, Fla. November 1988. IEEE Computer Society Press.
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392–403, 1995.
Keneth C. Yager. The Mips R10000 superscalar microprocessor. IEEE Micro, pp. 28–40, 1996.

Download references

Author information

Authors and Affiliations

Dept. Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona
Roger Espasa & Mateo Valero

Authors

Roger Espasa
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Espasa, R., Valero, M. A Simulation Study of Decoupled Vector Architectures. The Journal of Supercomputing 14, 124–152 (1999). https://doi.org/10.1023/A:1008158808410

Download citation

Issue Date: September 1999
DOI: https://doi.org/10.1023/A:1008158808410

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Simulation Study of Decoupled Vector Architectures

Abstract

Access this article

Similar content being viewed by others

Automated Compiler Optimization of Multiple Vector Loads/Stores

The use of vector instructions of a processor architecture for emulating the vector instructions of another processor architecture

ParVec: vectorizing the PARSEC benchmark suite

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Simulation Study of Decoupled Vector Architectures

Abstract

Access this article

Similar content being viewed by others

Automated Compiler Optimization of Multiple Vector Loads/Stores

The use of vector instructions of a processor architecture for emulating the vector instructions of another processor architecture

ParVec: vectorizing the PARSEC benchmark suite

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation