skip to main content
10.1145/1084834.1084908acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
Article

Novel architecture for loop acceleration: a case study

Published: 19 September 2005 Publication History

Abstract

In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, compared to 1.58x for the HLS approach and 1.33x for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%.

References

[1]
Altera Nios Processor. Altera Corp. (http://www.altera.com).
[2]
ARCtangent. ARC International (http://www.arc.com).
[3]
Design Compiler. Synopsys, Inc. (http://www.synopsys.com).
[4]
Independent JPEG Group. IJG (http://www.ijg.org).
[5]
Jazz DSP. Improv Inc. (http://www.improvsys.com).
[6]
JPEG Encoder Core. Alma Technologies (http://www.alma-tech.com).
[7]
SP-5flex. 3DSP Corp. (http://www.3dsp.com).
[8]
Xtensa Processor. Tensilica Inc. (http://www.tensilica.com).
[9]
Intel XScale Core : Developer's Manual. Intel Corporation, 2000.
[10]
T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35(2):59--67, 2002.
[11]
K. S. Chatha and R. Vemuri. A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems. In ISSS '98, pages 145 -- 151, Hsinchu, 1998.
[12]
N. Cheung, S. Parameswaran, and J. Henkel. INSIDE: INstruction Selection/Identification & Design Exploration for Extensible Processors. In ICCAD 2003, pages 291--297, 2003.
[13]
CriticalBlue. Coprocessor Synthesis -- Increassing System on Chip Platform ROI. Technical report, CriticalBlue, June 2004.
[14]
R. Ernst, J. Henkel, and T. Benner. Hardware-Software Cosynthesis for Microcontrollers. In IEEE Design & Test, volume 10, pages 64--75, 1993.
[15]
T. Glökler and H. Meyr. Power Reduction for ASIPS: A Case Study. In IEEE Workshop on Signal Processing Systems, pages 235--246, Antwerp, Belgium, 2001.
[16]
R. K. Gupta and G. De Micheli. Specification and Analysis of Timing Constraints for Embedded Systems. IEEE Trans. of Computer-Aided Design of Integrated Circuits and Systems, 16(3):241--256, 1997.
[17]
S. Gupta, N. Dutt, R. Gupta, and A. Nicolau. SPARK: A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations. In VLSID 2003, 2003.
[18]
S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The Chimaera Reconfigurable Functional Unit. IEEE Trans. on Very Large Scale Integration Systems, 12(2):206--217, 2004.
[19]
J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, 3rd edition, 2003.
[20]
S. Hiroyuki and Y. Teruhiko. Characteristics of Loop Unrolling Effect: Software Pipelining and Memory Latency Hiding. In IWIA 2001, pages 63--72, Maui, HI USA, 2001.
[21]
J. K. Hunter, J. V. McCanny, A. Simpson, Y. Hu, and J. G. Doherty. JPEG Encoder System-On-a-Chip Demonstrator. In Asilomar Conf. on Signals, Systems, and Computers, volume 1, pages 762--766, 1999.
[22]
M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and M. Imai. PEAS-III: An ASIP Design Environment. In ICCD 2000, pages 430--436, Austin, TX, USA, 2000.
[23]
J. Jeon and K. Choi. Loop Pipelining in Hardware-Software Partitioning. In ASP-DAC '98, pages 361--366, Yokohama, Japan, 1998. February 10-13.
[24]
A. Langi and W. Kinsner. An Architectural Design of a Wavelet Coprocessor. In CCECE 1994, volume 2, pages 497 -- 500, Halifax, NS, 1994.
[25]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In IEEE/ACM Int. Symp. on Microarchitecture, pages 330--335, 1997.
[26]
H.-Y. Lin, T.-J. Lin, C.-M. Chao, Y.-C. Liao, C.-W. Liu, and C.-W. Jen. Static floating-point unit with implicit exponent tracking for embedded DSP. In ISCAS 2004, volume 2, pages 821 -- 824, 2004.
[27]
T.-J. Lin and C.-W. Jen. CASCADE - configurable and scalable DSP environment. In ISCAS 2002, volume 4, pages 870 -- 873, 2002.
[28]
E. Maas, D. Herrmann, R. Ernst, P. Rüffer, S. Hasenzahl, and M. Seitz. A Processor-coprocessor Architecture for High End Video Applications. In ICASSP 1997, volume 1, pages 595 -- 598, Munich, 1997.
[29]
S. Parameswaran, M. F. Parkinson, and P. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):1263--1274, 2000.
[30]
J. M. D. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran. Rapid Embedded Hardware/Software System Generation. In VLSID 2005, pages 111--116, 2005.
[31]
R. Razdan and M. D. Smith. A High-Performance Microarchitecture with Hardware-Programmable Functional Units. In MICRO-27, pages 172--180, 1994.
[32]
S. L. Shee. VLSI Chip Implementation for Communication Protocols : JSCHIP Project. Undergraduate thesis, The University of New South Wales, 2003.
[33]
G. Stitt, R. Lysecky, and F. Vahid. Dynamic Hardware/Software Partitioning: A First Approach. In DAC 2003, pages 250--255, 2003.
[34]
F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha. Custom-instruction synthesis for extensible-processor platforms. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(2):216--228, 2004.
[35]
J. Tubella and A. González. Control Speculation in Multithreaded Processors through Dynamic Loop Detection. In HPCA 1998, pages 14--23, 1998.
[36]
R. D. Wittig and P. Chow. OneChip: An FPGA Processor With Reconfigurable Logic. In FCCM 1996, pages 126--135, Napa Valley, CA, 1996.
[37]
B.-F. Wu and C.-F. Lin. An Efficient Architecture for JPEG2000 Coprocessor. IEEE Trans. on Consumer Electronics, 50(4):1183 -- 1189, 2004.

Cited By

View all
  • (2024)MOSAIC: Maximizing ResOurce Sharing in Behavioral Application SpecIfic ProCessorsMicroprocessors and Microsystems10.1016/j.micpro.2024.105039106(105039)Online publication date: Apr-2024
  • (2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
  • (2013)Rapid, low-power loop execution in a network of functional unitsProceedings of the 17th Panhellenic Conference on Informatics10.1145/2491845.2491859(211-218)Online publication date: 19-Sep-2013
  • Show More Cited By

Index Terms

  1. Novel architecture for loop acceleration: a case study

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CODES+ISSS '05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
    September 2005
    356 pages
    ISBN:1595931619
    DOI:10.1145/1084834
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 September 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ASIP
    2. architecture
    3. coprocessor
    4. hardware/software partitioning
    5. latency hiding
    6. loop acceleration
    7. loop optimization
    8. loop pipelining
    9. tightly coupled

    Qualifiers

    • Article

    Conference

    CODES/ISSS05

    Acceptance Rates

    CODES+ISSS '05 Paper Acceptance Rate 50 of 200 submissions, 25%;
    Overall Acceptance Rate 280 of 864 submissions, 32%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MOSAIC: Maximizing ResOurce Sharing in Behavioral Application SpecIfic ProCessorsMicroprocessors and Microsystems10.1016/j.micpro.2024.105039106(105039)Online publication date: Apr-2024
    • (2015)Performance and power simulation of a functional-unit-network processor with simplescalar and wattchProceedings of the 19th Panhellenic Conference on Informatics10.1145/2801948.2801958(71-76)Online publication date: 1-Oct-2015
    • (2013)Rapid, low-power loop execution in a network of functional unitsProceedings of the 17th Panhellenic Conference on Informatics10.1145/2491845.2491859(211-218)Online publication date: 19-Sep-2013
    • (2009)The input-aware dynamic adaptation of area and performance for reconfigurable acceleratorProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/1508128.1508191(281-281)Online publication date: 24-Feb-2009
    • (2009)Introducing control-flow inclusion to support pipelining in custom instruction set extensions2009 IEEE 7th Symposium on Application Specific Processors10.1109/SASP.2009.5226328(114-121)Online publication date: Jul-2009
    • (2008)Speedups in embedded systems with a high-performance coprocessor datapathACM Transactions on Design Automation of Electronic Systems10.1145/1255456.125547212:3(1-22)Online publication date: 22-May-2008
    • (2008)Performance and energy consumption improvements in microprocessor systems utilizing a coprocessor data-pathJournal of Signal Processing Systems10.1007/s11265-007-0097-y50:2(179-200)Online publication date: 1-Feb-2008
    • (2007)Improving performance and energy consumption in embedded microprocessor platforms with a flexible custom coprocessor data-pathProceedings of the 17th ACM Great Lakes symposium on VLSI10.1145/1228784.1228792(2-7)Online publication date: 11-Mar-2007
    • (2007)Exploring the speedups of embedded microprocessor systems utilizing a high-performance coprocessor data-pathThe Journal of Supercomputing10.1007/s11227-006-0007-239:3(251-271)Online publication date: 1-Mar-2007
    • (2006)Performance Improvements in Microprocessor Systems Utilizing a Coprocessor Data-Path2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation10.1109/ICSAMOS.2006.300813(85-92)Online publication date: Jul-2006

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media