Abstract.
Instruction-Level Parallelism (ILP) is the main source of performance achievable in numerical applications. Architecturalresources and program recurrences are the main limitations to the amount of ILP exploitable from loops, the most time-consuming part in numerical computations. In order to increase the issue rate, current designs use growing degrees of resource replication for memory ports and functional units. But the high costs in terms of power, area and clock cycle of this technique are making it less attractive.
Clustering is a popular technique used to decentralize the design of wide issue cores and enable them to meet the technology constraints in terms of cycle time, area and power. Another approach is using wide functional units. These techniques reduce the port requirements in the register file and the memory subsystem, but they have scheduling constraints which may reduce considerably the exploitable ILP.
This paper evaluates several VLIW designs that make use of both techniques, analyzing power, area and performance, using loops belonging to the Perfect Club benchmark. From this study we conclude that applying either clustering, widening or both on the same core yields very power-efficient configurations with little area requirements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berry, M., Chen, D., Koss, P., Kuck, D.: The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers, Technical Report 827, CSRD, Univ. of Illinois at Urbana-Champaign (November 1988)
Brooks, D., Tiwari, V., Martsoni, M.: Wattch: A Framework for Architectural- Level Power Analysis and Optimizations. In: Int’l Symp. on Computer Architecture, ISCA 2000 (2000)
Faraboschi, P., Brown, G., Desoli, G., Homewood, F.: Lx: A technology platform for customizable VLIW embedded processing. In: Proc. 27th Annual Intl. Symp. on Computer Architecture, (June 2000), pp. 203-213 (2000)
Gwennap, L.: AltiVec Vectorizes PowerPC. Microprocessor Report 12(6) (May 1998)
Hrishikesh, M.S., Jouppi, N.P., Farkas, K.I., Burger, D., Keckler, S.W., Shivakumar, P.: The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays. In: Proc. of the 29thSymp. on Comp. Arch (ISCA 2002) (May 2002)
Kessler, R.E.: The Alpha 21264 Microprocessor. IEEE Micro 19(2) (March/April 1999)
Llosa, J., Valero, M., Ayguadé, E., González, A.: Hypernode reduction modulo scheduling. In: Proc. of the 28thAnnual Int. Symp. on Microarchitecture (MICRO- 28),pp. 350-360 (November 1995)
Lòpez, D., Llosa, J., Valero, M., Ayguadé, E.: Cost–Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures. IEEE Trans. on Comp. 50(10), 1033–1051 (2001)
Rau, B.R., Glaeser, C.D.: Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In: Proc. 14th Ann. Microprogramming Workshop, (October 1981), pp. 183-197 (1981)
Rixner, S., Dally, W.J., Khailany, B., Mattson, P., Kapasi, U.J., Owens, J.D.: Register organization for media processing. In: Proceedings of Sixth International Symposium on High-Performance Computer Architecture, HPCA-6 (2000)
T.I.Inc. TMS320C62x/67x CPU and Instruction Set Reference Guide (1998)
Watanabe, T.: The NEC SX-3 Supercomputer System. In: Proc. ComCon 1991, pp. 303- 308 (1991)
White, S.W., Dhawan, S.: POWER2: Next Generation of the RISC System/6000 Family. IBM J. Research and Development 38(5), 493–502 (1994)
Wilton, S.J.E., Jouppi, N.P.: An enhanced Cache Access and Cycle Time Model. IEEE. J. Solid-State Circuits 31(5), 677–688 (1996)
Zalamea, J., Llosa, J., Ayguadé, E., Valero, M.: MIRS: Modulo Scheduling with integrated register spilling. In: Dietz, H.G. (ed.) LCPC 2001. LNCS, vol. 2624, Springer, Heidelberg (2003)
Zalamea, J., Llosa, J., Ayguadé, E., Valero, M.: Modulo Scheduling with integrated register spilling for Clustered VLIW Architectures. In: Proc. 34th annual Int. Symp. on Microarch (December 2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pericás, M., Ayguadé, E., Zalamea, J., Llosa, J., Valero, M. (2003). Power-Performance Trade-Offs in Wide and Clustered VLIW Cores for Numerical Codes. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds) High Performance Computing. ISHPC 2003. Lecture Notes in Computer Science, vol 2858. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39707-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-39707-6_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20359-9
Online ISBN: 978-3-540-39707-6
eBook Packages: Springer Book Archive