Article

Novel architecture for loop acceleration: a case study

Authors:

Sri Parameswaran,

Newton CheungAuthors Info & Claims

CODES+ISSS '05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

Pages 297 - 302

https://doi.org/10.1145/1084834.1084908

Published: 19 September 2005 Publication History

Abstract

In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, compared to 1.58x for the HLS approach and 1.33x for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%.

References

[1]

Altera Nios Processor. Altera Corp. (http://www.altera.com).

[2]

ARCtangent. ARC International (http://www.arc.com).

[3]

Design Compiler. Synopsys, Inc. (http://www.synopsys.com).

[4]

Independent JPEG Group. IJG (http://www.ijg.org).

[5]

Jazz DSP. Improv Inc. (http://www.improvsys.com).

[6]

JPEG Encoder Core. Alma Technologies (http://www.alma-tech.com).

[7]

SP-5flex. 3DSP Corp. (http://www.3dsp.com).

[8]

Xtensa Processor. Tensilica Inc. (http://www.tensilica.com).

[9]

Intel XScale Core : Developer's Manual. Intel Corporation, 2000.

[10]

T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35(2):59--67, 2002.

Digital Library

[11]

K. S. Chatha and R. Vemuri. A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems. In ISSS '98, pages 145 -- 151, Hsinchu, 1998.

Digital Library

[12]

N. Cheung, S. Parameswaran, and J. Henkel. INSIDE: INstruction Selection/Identification & Design Exploration for Extensible Processors. In ICCAD 2003, pages 291--297, 2003.

Digital Library

[13]

CriticalBlue. Coprocessor Synthesis -- Increassing System on Chip Platform ROI. Technical report, CriticalBlue, June 2004.

[14]

R. Ernst, J. Henkel, and T. Benner. Hardware-Software Cosynthesis for Microcontrollers. In IEEE Design & Test, volume 10, pages 64--75, 1993.

Digital Library

[15]

T. Glökler and H. Meyr. Power Reduction for ASIPS: A Case Study. In IEEE Workshop on Signal Processing Systems, pages 235--246, Antwerp, Belgium, 2001.

[16]

R. K. Gupta and G. De Micheli. Specification and Analysis of Timing Constraints for Embedded Systems. IEEE Trans. of Computer-Aided Design of Integrated Circuits and Systems, 16(3):241--256, 1997.

Digital Library

[17]

S. Gupta, N. Dutt, R. Gupta, and A. Nicolau. SPARK: A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations. In VLSID 2003, 2003.

Digital Library

[18]

S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The Chimaera Reconfigurable Functional Unit. IEEE Trans. on Very Large Scale Integration Systems, 12(2):206--217, 2004.

Digital Library

[19]

J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, 3rd edition, 2003.

Digital Library

[20]

S. Hiroyuki and Y. Teruhiko. Characteristics of Loop Unrolling Effect: Software Pipelining and Memory Latency Hiding. In IWIA 2001, pages 63--72, Maui, HI USA, 2001.

Digital Library

[21]

J. K. Hunter, J. V. McCanny, A. Simpson, Y. Hu, and J. G. Doherty. JPEG Encoder System-On-a-Chip Demonstrator. In Asilomar Conf. on Signals, Systems, and Computers, volume 1, pages 762--766, 1999.

[22]

M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and M. Imai. PEAS-III: An ASIP Design Environment. In ICCD 2000, pages 430--436, Austin, TX, USA, 2000.

Digital Library

[23]

J. Jeon and K. Choi. Loop Pipelining in Hardware-Software Partitioning. In ASP-DAC '98, pages 361--366, Yokohama, Japan, 1998. February 10-13.

[24]

A. Langi and W. Kinsner. An Architectural Design of a Wavelet Coprocessor. In CCECE 1994, volume 2, pages 497 -- 500, Halifax, NS, 1994.

[25]

C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In IEEE/ACM Int. Symp. on Microarchitecture, pages 330--335, 1997.

Digital Library

[26]

H.-Y. Lin, T.-J. Lin, C.-M. Chao, Y.-C. Liao, C.-W. Liu, and C.-W. Jen. Static floating-point unit with implicit exponent tracking for embedded DSP. In ISCAS 2004, volume 2, pages 821 -- 824, 2004.

[27]

T.-J. Lin and C.-W. Jen. CASCADE - configurable and scalable DSP environment. In ISCAS 2002, volume 4, pages 870 -- 873, 2002.

[28]

E. Maas, D. Herrmann, R. Ernst, P. Rüffer, S. Hasenzahl, and M. Seitz. A Processor-coprocessor Architecture for High End Video Applications. In ICASSP 1997, volume 1, pages 595 -- 598, Munich, 1997.

Digital Library

[29]

S. Parameswaran, M. F. Parkinson, and P. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):1263--1274, 2000.

Digital Library

[30]

J. M. D. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran. Rapid Embedded Hardware/Software System Generation. In VLSID 2005, pages 111--116, 2005.

Digital Library

[31]

R. Razdan and M. D. Smith. A High-Performance Microarchitecture with Hardware-Programmable Functional Units. In MICRO-27, pages 172--180, 1994.

Digital Library

[32]

S. L. Shee. VLSI Chip Implementation for Communication Protocols : JSCHIP Project. Undergraduate thesis, The University of New South Wales, 2003.

[33]

G. Stitt, R. Lysecky, and F. Vahid. Dynamic Hardware/Software Partitioning: A First Approach. In DAC 2003, pages 250--255, 2003.

Digital Library

[34]

F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha. Custom-instruction synthesis for extensible-processor platforms. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(2):216--228, 2004.

Digital Library

[35]

J. Tubella and A. González. Control Speculation in Multithreaded Processors through Dynamic Loop Detection. In HPCA 1998, pages 14--23, 1998.

Digital Library

[36]

R. D. Wittig and P. Chow. OneChip: An FPGA Processor With Reconfigurable Logic. In FCCM 1996, pages 126--135, Napa Valley, CA, 1996.

[37]

B.-F. Wu and C.-F. Lin. An Efficient Architecture for JPEG2000 Coprocessor. IEEE Trans. on Consumer Electronics, 50(4):1183 -- 1189, 2004.

Digital Library

Cited By

Si QCarrion Schafer B(2024)MOSAIC: Maximizing ResOurce Sharing in Behavioral Application SpecIfic ProCessorsMicroprocessors and Microsystems10.1016/j.micpro.2024.105039106(105039)Online publication date: Apr-2024
https://doi.org/10.1016/j.micpro.2024.105039
Kalaitzidis KDimitriou GStamoulis GDossis MGeorge GStefanos GLazaros MPanagiotis TCleo S(2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1145/2801948.2801958
Tziouvaras ADimitriou GKetikidis PMargaritis KVlahavas IChatzigeorgiou AEleftherakis GStamelos I(2013)Rapid, low-power loop execution in a network of functional unitsProceedings of the 17th Panhellenic Conference on Informatics10.1145/2491845.2491859(211-218)Online publication date: 19-Sep-2013
https://dl.acm.org/doi/10.1145/2491845.2491859
Show More Cited By

Index Terms

Novel architecture for loop acceleration: a case study
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Joint affine transformation and loop pipelining for mapping nested loop on CGRAs
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

Coarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on ...
Speculative Loop-Pipelining in Binary Translation for Hardware Acceleration

Multimedia and DSP applications have several computationally intensive kernels which are often off loaded and accelerated by application-specific hardware. This paper presents a speculative loop pipelining technique to overcome limitations of binary ...
Accelerating loops for coarse grained reconfigurable architectures using instruction extensions
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

Aggressive embedded processors are often equipped with general purpose cores and special purpose acceleration logics. In our paper, we consider a reconfigurable processor that consists of very long instruction word (VLIW) cores and coarse grained ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CODES+ISSS '05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

September 2005

356 pages

ISBN:1595931619

DOI:10.1145/1084834

General Chairs:
Petru Eles
Linköping University, Sweden
,
Axel Jantsch
Royal Institute of Technology, Sweden
,
Program Chair:
Reinaldo Bergamaschi
IBM T. J. Watson Research Center

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CODES/ISSS05

Sponsor:

CODES/ISSS05: Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

September 19 - 21, 2005

NJ, Jersey City, USA

Acceptance Rates

CODES+ISSS '05 Paper Acceptance Rate 50 of 200 submissions, 25%;

Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
437
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Si QCarrion Schafer B(2024)MOSAIC: Maximizing ResOurce Sharing in Behavioral Application SpecIfic ProCessorsMicroprocessors and Microsystems10.1016/j.micpro.2024.105039106(105039)Online publication date: Apr-2024
https://doi.org/10.1016/j.micpro.2024.105039
Kalaitzidis KDimitriou GStamoulis GDossis MGeorge GStefanos GLazaros MPanagiotis TCleo S(2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1145/2801948.2801958
Tziouvaras ADimitriou GKetikidis PMargaritis KVlahavas IChatzigeorgiou AEleftherakis GStamelos I(2013)Rapid, low-power loop execution in a network of functional unitsProceedings of the 17th Panhellenic Conference on Informatics10.1145/2491845.2491859(211-218)Online publication date: 19-Sep-2013
https://dl.acm.org/doi/10.1145/2491845.2491859
Yan LWang GChen TChow PCheung P(2009)The input-aware dynamic adaptation of area and performance for reconfigurable acceleratorProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/1508128.1508191(281-281)Online publication date: 24-Feb-2009
https://dl.acm.org/doi/10.1145/1508128.1508191
Zuluaga MKluter TBrisk PTopham NIenne P(2009)Introducing control-flow inclusion to support pipelining in custom instruction set extensions2009 IEEE 7th Symposium on Application Specific Processors10.1109/SASP.2009.5226328(114-121)Online publication date: Jul-2009
https://doi.org/10.1109/SASP.2009.5226328
Galanis MDimitroulakos GTragoudas SGoutis C(2008)Speedups in embedded systems with a high-performance coprocessor datapathACM Transactions on Design Automation of Electronic Systems10.1145/1255456.125547212:3(1-22)Online publication date: 22-May-2008
https://dl.acm.org/doi/10.1145/1255456.1255472
Galanis MDimitroulakos GGoutis C(2008)Performance and energy consumption improvements in microprocessor systems utilizing a coprocessor data-pathJournal of Signal Processing Systems10.1007/s11265-007-0097-y50:2(179-200)Online publication date: 1-Feb-2008
https://dl.acm.org/doi/10.1007/s11265-007-0097-y
Galanis MDimitroulakos GGoutis CZhou HMacii EYan ZMassoud Y(2007)Improving performance and energy consumption in embedded microprocessor platforms with a flexible custom coprocessor data-pathProceedings of the 17th ACM Great Lakes symposium on VLSI10.1145/1228784.1228792(2-7)Online publication date: 11-Mar-2007
https://dl.acm.org/doi/10.1145/1228784.1228792
Galanis MDimitroulakos GGoutis C(2007)Exploring the speedups of embedded microprocessor systems utilizing a high-performance coprocessor data-pathThe Journal of Supercomputing10.1007/s11227-006-0007-239:3(251-271)Online publication date: 1-Mar-2007
https://dl.acm.org/doi/10.1007/s11227-006-0007-2
Galanis MDimitroulakos GGoutis C(2006)Performance Improvements in Microprocessor Systems Utilizing a Coprocessor Data-Path2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation10.1109/ICSAMOS.2006.300813(85-92)Online publication date: Jul-2006
https://doi.org/10.1109/ICSAMOS.2006.300813

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents