Abstract
Decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. We simulate a selection of the Perfect Club and Specfp92 benchmark suites and compare their execution time on a conventional single port vector architecture with that of a decoupled vector architecture. Decoupling increases the performance by a factor greater than 1.4 for realistic memory latencies, and for an ideal memory system with zero latency, there is still a speedup of as much as 1.3. A significant portion of this paper is devoted to studying the tradeoffs involved in choosing a suitable size for the queues of the decoupled architecture. The hardware cost of the queues need not be large to achieve most of the performance advantages of decoupling.
Similar content being viewed by others
References
A. Agarwal. Performance tradeoffs in multithreaded processors. IEEE Transactions on Parallel and Distributed Systems, 3(5):525–539, 1992.
Bradley J. Benschneider, Andrew J. Black, William J. Bowhill, Sharon M. Britton, Daniel E. Dever, Dale R. Donchin, Robert J. Dupack, Richard M. Fromm, Mary K. Gowan, Paul E. Gronowski, Michael Kantrowitz, Marc E. Lamere, Shekhar Metha, Jeanne E. Meyer, Robert O. Mueller, Andy Olesin, Ronald P. Preston, Donald A. Priore, Sribalan Santhanam, Michael J. Smith, and Gilber M. Wolrich. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. IEEE Journal of Solid-State Circuits, 30(11):1203–1214, 1995.
W. C. Brantley and Joseph Weiss. Organization and architecture tradeoffs in FOM. In IEEE International Workshop on Computer Systems Organization, March 1983.
Tien-Fu Chen and Jean-Loup Baer. A performance study of software and hardware data prefetching strategies. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 223–232, 1994.
E. U. Cohler and J. E. Storer. Functionally parallel architectures for array processors. Computer, 14:28–36, 1981.
Convex Press, Richardson, Texas, USA. CONVEX Architecture Reference Manual (C Series), 6th edition, 1992.
R. Espasa, M. Valero, D. Padua, M. Jiménez, and E. Ayguadé. Quantitative analysis of vector code. In Euromicro Workshop on Parallel and Distributed Processing. IEEE Computer Society Press, 1995.
Roger Espasa and Mateo Valero. Decoupled vector architectures. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, pp. 281–290. IEEE Computer Society Press, 1996.
Roger Espasa and Mateo Valero. Multithreaded vector architectures. In Proceedings of the 3rd International Symposium on High Performance Computer Architecture, pp. 237–249. IEEE Computer Society Press, 1997.
Roger Espasa, Mateo Valero, and James E. Smith. Out-of-order vector architectures. In MICRO-30, pp. 160–170. IEEE Press, 1997.
J. R. Goodman, J. T. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young. PIPE: A VLSI Decoupled Architecture. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 20–27, June 1985.
P. Y. T. Hsu. Designing the TFP microprocessor. IEEE Micro, 14(2):23–33, 1994.
Ken Kennedy and Kathryn S. McKinley. Optimizing for parallelism and data locality. In Proceedings of the International Conference on Supercomputing, pp. 323–334, 1992.
L. I. Kontothanassis, R. A. Sugumar, G. J. Faanes, J. E. Smith, and M. L. Scott. Cache performance in vector supercomputers. In Proceedings of Supercomputing '94, Washington DC, November 1994. IEEE Computer Society Press.
Lizy Kurian, Paul T. Hulina, and Lee D. Coraor. Memory latency effects in decoupled architectures. IEEE Transactions on Computers, 43(10):1129–1139, 1994.
M. S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. SIGPLAN Notices, 23(7):318–328, 1988.
J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, 17(1):6–22, 1984.
W. Mangione-Smith, S. G. Abraham, and E. S. Davidson. Vector register design for polycyclic vector scheduling. In 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 154–163, Santa Clara, CA, 1991.
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
Willi Schönauer and Hartmut Häfner. Explaining the gap between theoretical peak performance and real performance for supercomputer architectures. Scientific Programming, 3:157–168, 1994.
James E. Smith. Decoupled Access/Execute Computer Architectures. ACM Transactions on Computer Systems, 2:289–308, 1984.
James E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski, D. L. Fowler, K. R. Scidmore, and J. P. Laudon. The ZS-1 central processor. In 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 199–204. CS Press, 1987.
James E. Smith, Shlomo Weiss, and Nicholas Y. Pang. A simulation study of decoupled architecture computers. IEEE Transactions on Computers, C-35(8):692–702, 1986.
Juho Tang, Edward S. Davidson, and Johau Tong. Polycyclic vector scheduling vs. chaining on 1-port vector supercomputers. In Proceedings of Supercomputing '88, pp. 122–129, Orlando, Fla. November 1988. IEEE Computer Society Press.
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392–403, 1995.
Keneth C. Yager. The Mips R10000 superscalar microprocessor. IEEE Micro, pp. 28–40, 1996.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Espasa, R., Valero, M. A Simulation Study of Decoupled Vector Architectures. The Journal of Supercomputing 14, 124–152 (1999). https://doi.org/10.1023/A:1008158808410
Issue Date:
DOI: https://doi.org/10.1023/A:1008158808410