Customized pipeline and instruction set architecture for embedded processing engines

Yazdanbakhsh, Amir; Salehi, Mostafa E.; Fakhraie, Sied Mehdi

doi:10.1007/s11227-013-1075-8

Customized pipeline and instruction set architecture for embedded processing engines

Published: 06 February 2014

Volume 68, pages 948–977, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Amir Yazdanbakhsh¹,
Mostafa E. Salehi¹ &
Sied Mehdi Fakhraie¹

289 Accesses
1 Citation
Explore all metrics

Abstract

Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mobile Ecosystem Driven Dynamic Pipeline Adaptation for Low Power

Hardware Acceleration of Red-Black Tree Management and Application to Just-In-Time Compilation

Article 06 June 2014

Application-Specific Processors

References

Swanson S, Putnam A, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers SJ (2006) Area-performance trade-offs in tiled dataflow architectures. in: Proceedings of the 33rd international symposium on computer architecture (ISCA’06), pp. 314–326
Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–59
Article Google Scholar
Lee SJ (2010) A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition. In: Porceedings of the IEEE international solid-state circuits conference, 2010, pp. 332–334
Bell S, et al (2008) TILE64\(^{TM}\) processor: a 64-core SoC with mesh interconnect. In: Porceedings ofthe IEEE international solid-state circuits conference, pp. 88–90
Jotwani R, et al (2010) An x86–64 core implemented in 32 nm SOI CMOS. In: Porceedings of the IEEE international solid-state circuits conference, pp. 106–107
Howard J, et al (2010) A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: Poreedings of the IEEE international solid-state circuits conference, pp. 108–110
Shin JL, et al (2010) A 40 nm 16-Core 128-thread CMT SPARC SoC processor. In: Porceedings of the IEEE international solid-state circuits conference, pp. 98–99
Johnson C, et al (2010) A wire-speed POWER\(^{TM}\) processor: 2.3G Hz, 45 nm SOI with 16 cores and 64 threads. In: Porceedings of the IEEE international solid-state circuits conference, pp. 104–106
Azizi O, Mahesri A, Lee BC, Patel SJ, Horowitz M (2010) Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In: Proceedings of the 37th international symposium on computer architecture (ISCA’10), pp. 26–36
Kapre N, DeHon A (2009) Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of the international conference on field programmable logic and applications, pp. 65–72
Truong DN et al (2009) A 167-processor computational platform in 65 nm CMOS. IEEE J Solid State Circuits 44(4):1130–1144
Article Google Scholar
Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41(7):33–38
Article Google Scholar
Borkar S (2007) Thousand core chips—a technology perspective. In: Proceedings of the design automation conference (DAC), pp. 746–749
Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s Law and its implications for multicore design. In: Proceedings of the 37th international symposium on computer, architecture (ISCA’10), pp. 362–370
Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E (2008) Register file power reduction using bypass sensitive compiler. IEEE Trans Comput Aided Des Integr Circuits Syst 27(6):1155–1159
Article Google Scholar
Nalluri R, Garg R, Panda PR (2007) Customization of register file banking architecture for low power. In: Proceedings of the 20th international conference on VLSI design (VLSID’07), pp. 239–244
Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(10):1259–1267
Article Google Scholar
Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extensions under microarchitectural constraints. In: Proceedings of the design automation conference (DAC), pp. 256–261
Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th Annu. IEEE/ACM, MICRO, pp. 129–140
Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: Proceedings of the CASES, pp. 69–78
Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans Comput Aided Des Integr Circuits Syst 25:1209–1229
Article Google Scholar
Chen X, Maskell DL, Sun Y (2007) Fast identification of custom instructions for extensible processors. IEEE Trans Comput Aided Des Integr Circuits Syst 26(2):359–368
Article MATH Google Scholar
Zyuban VV, Kogge PM (1998) The energy complexity of register files. In: Proceedings of the international symposium on low power, electronic design, pp. 305–310
Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: Proceedings of the design, automation & test in Europe (DATE)
Altera Corp. Nios processor reference handbook
Xilinx Inc., Microblaze soft processor core
Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE Micro 20:60–70
Article Google Scholar
Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing data-bandwidth to instruction-set extensions through register clustering. In: Proceedings of the international conference on computer aided design, pp. 166–177
Fischer JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture. Elsevier Inc, Compiler and Tools, Amsterdam
Kim NS, Mudge T (2003) Reducing register ports using delayed write-back queues and operand pre-fetch. In: Proceedings of the 17th annual international conference on Supercomputing, pp. 172–182
Pozzi L, Ienne P (2005) Exploiting pipelining to relax register-file port constraints of instruction set extensions. In: Proceedings of the international conference on compilers, architectures and synthesis for embedded systems, pp. 2–10
Atasu K, Dimond R, Mencer O, Luk W, Özturan C, Dünda G (2007) Optimizing instruction-set extensible processors under data bandwidth constraints. In: Proceedings of the design automation and test in, Europe, Mar. 2007, pp. 588–593
Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction processor synthesis. IEEE Trans Comput Aided Des Integr Circuits Syst 27(3):528–541
Article Google Scholar
Verma Ajay K, Brisk Philip, Ienne Paolo (2010) Fast, nearly optimal ISE identification with I/O serialization through maximal clique enumeration. IEEE Trans Comput Aided Des Integr Circuits Syst 29(3):341–354
Article Google Scholar
Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of the design automation conference (DAC), pp. 395–400
Moreano N, Borin E, de Souza C, Araujo G (2005) Efficient datapath merging for partially reconfigurable architectures. IEEE Trans Comput Aided Des Integr Circuits Syst 24(7):969–980
Article Google Scholar
Dinh Q, Chen D, Wong MDF (2008) Efficient ASIP design for configurable processors with fine-grained resource sharing. In: Proceedings of the ACM/SIGDA 16th international symposium on FPGA, pp. 99–106
Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom instruction set extensions. IEEE Trans Comput Aided Des Integr Circuits Syst 28(12):1788–1801
Article Google Scholar
Hennessy JL, Patterson DA (2005) Computer organization and design: the hardware/software interface, the Morgan Kaufmann Series in computer architecture and design, 3rd edn. Elsevier Inc., Amsterdam
Powell PMD, Vijaykumar TN (2002) Reducing register ports for higher speed and lower energy. In: Proceedings of the 35th annual IEEE/ACM international symposium on microarchitecture, pp. 171–182
Cong J, et al (2005) Instruction set extension with shadow registers for configurable processors. In: Proceedings of the FPGA, pp. 99–106
Liu H, Jayaseelan R, Mitra T (2006) Exploiting forwarding to improve data bandwidth of instruction-set extensions. In: Proceedings of the design automation conference (DAC), pp. 43–48
Chen X, Maskell DL (2007) Supporting multiple-input, multiple-output custom functions in configurable processors. J Syst Architect 53:263–271
Article Google Scholar
Salehi ME, Fakhraie SM (2009) Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development. J Syst Architect 55:373–386
Article Google Scholar
Salehi ME, Fakhraie SM, Yazdanbakhsh A (2012) Instruction set architectural guidelines for embedded packet-processing engines. J Syst Architect 58:112–125
Article Google Scholar
The GNU operating system, available online: http://www.gnu.org
Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for custom instruction selection. In: Proceedings of the international conference on future information technology (FutureTech), pp. 1–6
Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Locality considerations in exploring custom instruction selection algorithms. In: Proceedings of the ASQED
Yazdanbakhsh A, Kamal M, Salehi ME, Noori H, Fakhraie SM (2010) Energy-aware design space exploration of registerfile for extensible processors. In: Proceedings of the SAMOS
Sakai S, Togasaki M, Yamazaki K (2003) A note on greedy algorithms for the maximum weighted independent set problem. Discret Appl Math 126:313–322
Article MATH MathSciNet Google Scholar
Ramaswamy R, Weng N, Wolf T (2009) Analysis of network processing workloads. J Syst Architect 55(10—-12):421–433
Article Google Scholar
Biswas P, Atasu K, Choudhary V, Pozzi L, Dutt N, Ienne P (2004) Introduction of local memory elements in instruction set extensions. In: Proceedings of the 41st design automation conference, June 2004, pp. 729–734
She D, He Y, Corporaal H (2012) Energy efficient special instruction support in an embedded processor with compact ISA. In: proceedings of the CASES, pp. 131–140
Wu D, Ahn J, Lee I, Choi K (2012) Resource-shared custom instruction generation under performance/area constraints. International symposium on system on chip (SoC), pp. 1–6

Download references

Acknowledgments

The authors acknowledge the partial support received for this work under contract 149811/140 from Microelectronic Committee-Research Administration of the University of Tehran.

Author information

Authors and Affiliations

Nano Electronics Center of Excellence, University of Tehran, 14395-515 , Tehran, Islamic Republic of Iran
Amir Yazdanbakhsh, Mostafa E. Salehi & Sied Mehdi Fakhraie

Authors

Amir Yazdanbakhsh
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa E. Salehi
View author publications
You can also search for this author in PubMed Google Scholar
Sied Mehdi Fakhraie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mostafa E. Salehi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yazdanbakhsh, A., Salehi, M.E. & Fakhraie, S.M. Customized pipeline and instruction set architecture for embedded processing engines. J Supercomput 68, 948–977 (2014). https://doi.org/10.1007/s11227-013-1075-8

Download citation

Published: 06 February 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11227-013-1075-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Customized pipeline and instruction set architecture for embedded processing engines

Abstract

Access this article

Similar content being viewed by others

Mobile Ecosystem Driven Dynamic Pipeline Adaptation for Low Power

Hardware Acceleration of Red-Black Tree Management and Application to Just-In-Time Compilation

Application-Specific Processors

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Customized pipeline and instruction set architecture for embedded processing engines

Abstract

Access this article

Similar content being viewed by others

Mobile Ecosystem Driven Dynamic Pipeline Adaptation for Low Power

Hardware Acceleration of Red-Black Tree Management and Application to Just-In-Time Compilation

Application-Specific Processors

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation