research-article

Rapid, low-power loop execution in a network of functional units

Authors:

Athanassios Tziouvaras,

Georgios DimitriouAuthors Info & Claims

PCI '13: Proceedings of the 17th Panhellenic Conference on Informatics

Pages 211 - 218

https://doi.org/10.1145/2491845.2491859

Published: 19 September 2013 Publication History

Abstract

The need for high-performance computing and low-power operation has led to the emergence of new processor architectures, with most recent designs based on the combination of multiple cores and multiple threads per core. In our work, we are exploring an architecture of multiple instruction pipelines, which merge into a common back-end, formed as a network of functional units. We focus on the back-end in this paper, and in particular, on a rapid, low-power execution of loops, based on data flow. We dispatch the loop body instructions on the network of functional units only once, and we then let the loop execute in a dataflow manner, without any other instruction issue before loop completion. In this way, we do not only speed up the loop execution but we also save energy, since during the execution of the loop the whole front end of the pipeline is not used and can be turned off. We have simulated the functional unit network on microarchitecture level, running a number of Livermore loops. The results we obtained show that the proposed architecture can accelerate loop execution by up to N/k, for a network of N units and loop body size of N instructions, and an issue rate of k instructions per cycle.

References

[1]

Agarwal, V., Hrishikesh, M., Keckler, S. and Burger, D. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. 2000. In Proc. of the 27th Int. Symp. on Comp. Arch. (ISCA'00), 248--259.

Digital Library

[2]

Blake, G., Dreslinski, R., Mudge, T. and Flautner, K. 2010. Evolution of Thread-Level Parallelism in Desktop Applications. In Proc. of the 37th Int. Symp. on Comp. Arch. (ISCA'10), 302--313.

Digital Library

[3]

Burger, D. and Keckler, S. 2005. 19.5 Breaking the GOP/Watt Barrier with EDGE Architectures. GOMACTech Intelligent Technologies Conference.

[4]

Burger, D., Keckler, S., McKinley, K., Dahlin, M., John, L., Lin, C., Moore, C., Burrill, J., McDonald, R., Yoder, W. and the TRIPS Team. 2004. Scaling to the End of Silicon with EDGE Architectures. In Journal Computer Archives Volume 37 Issue 7, 44--55.

Digital Library

[5]

Clark, N., Hormati, A. and Mahlke, S. 2008. VEAL: Virtualized Execution Accelerator for Loops. In Proc. ISCA'08.

Digital Library

[6]

Gebhart, M., Maher, B., Coons, K., Diamond, J., Gratz, P., Marino, M., Ranganathan, N., Robatmili, B., Smith, A., Burrill, J., Keckler, S., Burger, D. and McKinley, K. 2009. An Evaluation of the TRIPS Computer System. In Proc. of the 14th Int. Conf. on Arch. Support for Programming Languages and Operating Systems (ASPLOS XIV), 1--12.

Digital Library

[7]

Gupta, S., Feng, S., Ansari, A. and Mahlke, S. 2010. Erasing Core Boundaries for Robust and Configurable Performance. In Proc. Int. Symp. on Microarch. (MICRO'10).

Digital Library

[8]

Mathew, B. and Davis, A. 2004. A Loop Accelerator for Low Power Embedded VLIW Processors. In Proc. of the 2nd Int. Conf. on CODES+ISSS'04.

Digital Library

[9]

Paschalis, A. 1999. An effective BIST architecture for fast multiplier cores. Design, Automation and Test in Europe Conference and Exhibition.

Digital Library

[10]

Rajagopalan, D., Sethumadhavan, S., Burger, D. and Keckler, S. 2004. Scalable Selective Re-Execution for EDGE Architectures. In Proc. of the 11th Int. Conf. on Arch. Support for Programming Languages and Operating Systems (ASPLOS XI), 120--132.

Digital Library

[11]

Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S. and Moore, C. 2003. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. In Proc. of the 30th Int. Symp. on Comp. Arch. (ISCA'03), 422--433.

Digital Library

[12]

Shee, S., Parameswaran, S. and Cheung, N. 2005. Novel Architecture for Loop Acceleration: A Case Study. In Proc. of the 3rd Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05), 297--302.

Digital Library

[13]

Smith, A., Burrill, J., Gibson, J., Maher, B., Nethercote, N., Yoder, B., Burger, D. and McKinley K. 2006. Compiling for EDGE Architectures. In Proc. of the Int. Symp. on Code Generation and Optimization (CGO'06), 185--195.

Digital Library

[14]

Veeramachaneni and Sreehari. 2007. Efficient Design of 32-bit Comparators Using Carry Look-ahead Logic. Conference Publications, Montreal, Que.

[15]

Yu, J., Lemieux, G. and Eagleston, C. 2008. Vector Processing as a Soft-core CPU Accelerator. In Proc. of the 16th Int. Symp. on Field Programmable Gate Arrays (FPGA'08), 222--232.

Digital Library

Cited By

Kalaitzidis KDimitriou GStamoulis GDossis MGeorge GStefanos GLazaros MPanagiotis TCleo S(2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1145/2801948.2801958

Index Terms

Rapid, low-power loop execution in a network of functional units
1. Applied computing
  1. Computers in other domains
    1. Personal computers and PC applications
      1. Microcomputers
2. Computer systems organization
  1. Architectures

Recommendations

Performance and power simulation of a functional-unit-network processor with simplescalar and wattch
PCI '15: Proceedings of the 19th Panhellenic Conference on Informatics

Loop acceleration is a means to enhance performance of a single- or multiple-issue microprocessor core. A new edge-like processor architecture incorporates a loop accelerator directly in the out-of-order back end of the processor, forming an extended ...
Low power general purpose loop acceleration for NDP applications
PCI '20: Proceedings of the 24th Pan-Hellenic Conference on Informatics

Modern processor architectures face a throughput scaling problem as the performance bottleneck shifts from the core pipeline to the data transfer operations between the dynamic random access memory (DRAM) and the processor chip. To address such issue ...
An EPIC Processor with Pending Functional Units
ISHPC '02: Proceedings of the 4th International Symposium on High Performance Computing

The Itanium processor, an implementation of an Explicitly Parallel Instruction Computing (EPIC) architecture, is an in-order processor that fetches, executes, and forwards results to functional units in-order. The architecture relies heavily on the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PCI '13: Proceedings of the 17th Panhellenic Conference on Informatics

September 2013

359 pages

ISBN:9781450319690

DOI:10.1145/2491845

General Chairs:
Panayiotis H. Ketikidis
Greek Computer Society
,
Kostas Margaritis
University of Macedonia
,
Ioannis Vlahavas
Aristotle University of Thessaloniki
,
Program Chairs:
Alexandros Chatzigeorgiou
University of Macedonia
,
George Eleftherakis
The University of Sheffield
,
Ioannis Stamelos
Aristotle University of Thessaloniki

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Macedonia
Aristotle University of Thessaloniki
The University of Sheffield: The University of Sheffield
Alexander TEI of Thessaloniki

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PCI 2013

Sponsor:

The University of Sheffield

PCI 2013: 17th Panhellenic Conference on Informatics

September 19 - 21, 2013

Thessaloniki, Greece

Acceptance Rates

Overall Acceptance Rate 190 of 390 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
46
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kalaitzidis KDimitriou GStamoulis GDossis MGeorge GStefanos GLazaros MPanagiotis TCleo S(2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1145/2801948.2801958

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten