Abstract
Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.
Similar content being viewed by others
References
Swanson S, Putnam A, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers SJ (2006) Area-performance trade-offs in tiled dataflow architectures. in: Proceedings of the 33rd international symposium on computer architecture (ISCA’06), pp. 314–326
Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–59
Lee SJ (2010) A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition. In: Porceedings of the IEEE international solid-state circuits conference, 2010, pp. 332–334
Bell S, et al (2008) TILE64\(^{TM}\) processor: a 64-core SoC with mesh interconnect. In: Porceedings ofthe IEEE international solid-state circuits conference, pp. 88–90
Jotwani R, et al (2010) An x86–64 core implemented in 32 nm SOI CMOS. In: Porceedings of the IEEE international solid-state circuits conference, pp. 106–107
Howard J, et al (2010) A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: Poreedings of the IEEE international solid-state circuits conference, pp. 108–110
Shin JL, et al (2010) A 40 nm 16-Core 128-thread CMT SPARC SoC processor. In: Porceedings of the IEEE international solid-state circuits conference, pp. 98–99
Johnson C, et al (2010) A wire-speed POWER\(^{TM}\) processor: 2.3G Hz, 45 nm SOI with 16 cores and 64 threads. In: Porceedings of the IEEE international solid-state circuits conference, pp. 104–106
Azizi O, Mahesri A, Lee BC, Patel SJ, Horowitz M (2010) Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In: Proceedings of the 37th international symposium on computer architecture (ISCA’10), pp. 26–36
Kapre N, DeHon A (2009) Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of the international conference on field programmable logic and applications, pp. 65–72
Truong DN et al (2009) A 167-processor computational platform in 65 nm CMOS. IEEE J Solid State Circuits 44(4):1130–1144
Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41(7):33–38
Borkar S (2007) Thousand core chips—a technology perspective. In: Proceedings of the design automation conference (DAC), pp. 746–749
Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s Law and its implications for multicore design. In: Proceedings of the 37th international symposium on computer, architecture (ISCA’10), pp. 362–370
Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E (2008) Register file power reduction using bypass sensitive compiler. IEEE Trans Comput Aided Des Integr Circuits Syst 27(6):1155–1159
Nalluri R, Garg R, Panda PR (2007) Customization of register file banking architecture for low power. In: Proceedings of the 20th international conference on VLSI design (VLSID’07), pp. 239–244
Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(10):1259–1267
Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extensions under microarchitectural constraints. In: Proceedings of the design automation conference (DAC), pp. 256–261
Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th Annu. IEEE/ACM, MICRO, pp. 129–140
Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: Proceedings of the CASES, pp. 69–78
Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans Comput Aided Des Integr Circuits Syst 25:1209–1229
Chen X, Maskell DL, Sun Y (2007) Fast identification of custom instructions for extensible processors. IEEE Trans Comput Aided Des Integr Circuits Syst 26(2):359–368
Zyuban VV, Kogge PM (1998) The energy complexity of register files. In: Proceedings of the international symposium on low power, electronic design, pp. 305–310
Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: Proceedings of the design, automation & test in Europe (DATE)
Altera Corp. Nios processor reference handbook
Xilinx Inc., Microblaze soft processor core
Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE Micro 20:60–70
Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing data-bandwidth to instruction-set extensions through register clustering. In: Proceedings of the international conference on computer aided design, pp. 166–177
Fischer JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture. Elsevier Inc, Compiler and Tools, Amsterdam
Kim NS, Mudge T (2003) Reducing register ports using delayed write-back queues and operand pre-fetch. In: Proceedings of the 17th annual international conference on Supercomputing, pp. 172–182
Pozzi L, Ienne P (2005) Exploiting pipelining to relax register-file port constraints of instruction set extensions. In: Proceedings of the international conference on compilers, architectures and synthesis for embedded systems, pp. 2–10
Atasu K, Dimond R, Mencer O, Luk W, Özturan C, Dünda G (2007) Optimizing instruction-set extensible processors under data bandwidth constraints. In: Proceedings of the design automation and test in, Europe, Mar. 2007, pp. 588–593
Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction processor synthesis. IEEE Trans Comput Aided Des Integr Circuits Syst 27(3):528–541
Verma Ajay K, Brisk Philip, Ienne Paolo (2010) Fast, nearly optimal ISE identification with I/O serialization through maximal clique enumeration. IEEE Trans Comput Aided Des Integr Circuits Syst 29(3):341–354
Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of the design automation conference (DAC), pp. 395–400
Moreano N, Borin E, de Souza C, Araujo G (2005) Efficient datapath merging for partially reconfigurable architectures. IEEE Trans Comput Aided Des Integr Circuits Syst 24(7):969–980
Dinh Q, Chen D, Wong MDF (2008) Efficient ASIP design for configurable processors with fine-grained resource sharing. In: Proceedings of the ACM/SIGDA 16th international symposium on FPGA, pp. 99–106
Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom instruction set extensions. IEEE Trans Comput Aided Des Integr Circuits Syst 28(12):1788–1801
Hennessy JL, Patterson DA (2005) Computer organization and design: the hardware/software interface, the Morgan Kaufmann Series in computer architecture and design, 3rd edn. Elsevier Inc., Amsterdam
Powell PMD, Vijaykumar TN (2002) Reducing register ports for higher speed and lower energy. In: Proceedings of the 35th annual IEEE/ACM international symposium on microarchitecture, pp. 171–182
Cong J, et al (2005) Instruction set extension with shadow registers for configurable processors. In: Proceedings of the FPGA, pp. 99–106
Liu H, Jayaseelan R, Mitra T (2006) Exploiting forwarding to improve data bandwidth of instruction-set extensions. In: Proceedings of the design automation conference (DAC), pp. 43–48
Chen X, Maskell DL (2007) Supporting multiple-input, multiple-output custom functions in configurable processors. J Syst Architect 53:263–271
Salehi ME, Fakhraie SM (2009) Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development. J Syst Architect 55:373–386
Salehi ME, Fakhraie SM, Yazdanbakhsh A (2012) Instruction set architectural guidelines for embedded packet-processing engines. J Syst Architect 58:112–125
The GNU operating system, available online: http://www.gnu.org
Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for custom instruction selection. In: Proceedings of the international conference on future information technology (FutureTech), pp. 1–6
Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Locality considerations in exploring custom instruction selection algorithms. In: Proceedings of the ASQED
Yazdanbakhsh A, Kamal M, Salehi ME, Noori H, Fakhraie SM (2010) Energy-aware design space exploration of registerfile for extensible processors. In: Proceedings of the SAMOS
Sakai S, Togasaki M, Yamazaki K (2003) A note on greedy algorithms for the maximum weighted independent set problem. Discret Appl Math 126:313–322
Ramaswamy R, Weng N, Wolf T (2009) Analysis of network processing workloads. J Syst Architect 55(10—-12):421–433
Biswas P, Atasu K, Choudhary V, Pozzi L, Dutt N, Ienne P (2004) Introduction of local memory elements in instruction set extensions. In: Proceedings of the 41st design automation conference, June 2004, pp. 729–734
She D, He Y, Corporaal H (2012) Energy efficient special instruction support in an embedded processor with compact ISA. In: proceedings of the CASES, pp. 131–140
Wu D, Ahn J, Lee I, Choi K (2012) Resource-shared custom instruction generation under performance/area constraints. International symposium on system on chip (SoC), pp. 1–6
Acknowledgments
The authors acknowledge the partial support received for this work under contract 149811/140 from Microelectronic Committee-Research Administration of the University of Tehran.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yazdanbakhsh, A., Salehi, M.E. & Fakhraie, S.M. Customized pipeline and instruction set architecture for embedded processing engines. J Supercomput 68, 948–977 (2014). https://doi.org/10.1007/s11227-013-1075-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-1075-8