Skip to main content
Log in

Customized pipeline and instruction set architecture for embedded processing engines

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Swanson S, Putnam A, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers SJ (2006) Area-performance trade-offs in tiled dataflow architectures. in: Proceedings of the 33rd international symposium on computer architecture (ISCA’06), pp. 314–326

  2. Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–59

    Article  Google Scholar 

  3. Lee SJ (2010) A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition. In: Porceedings of the IEEE international solid-state circuits conference, 2010, pp. 332–334

  4. Bell S, et al (2008) TILE64\(^{TM}\) processor: a 64-core SoC with mesh interconnect. In: Porceedings ofthe IEEE international solid-state circuits conference, pp. 88–90

  5. Jotwani R, et al (2010) An x86–64 core implemented in 32 nm SOI CMOS. In: Porceedings of the IEEE international solid-state circuits conference, pp. 106–107

  6. Howard J, et al (2010) A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: Poreedings of the IEEE international solid-state circuits conference, pp. 108–110

  7. Shin JL, et al (2010) A 40 nm 16-Core 128-thread CMT SPARC SoC processor. In: Porceedings of the IEEE international solid-state circuits conference, pp. 98–99

  8. Johnson C, et al (2010) A wire-speed POWER\(^{TM}\) processor: 2.3G Hz, 45 nm SOI with 16 cores and 64 threads. In: Porceedings of the IEEE international solid-state circuits conference, pp. 104–106

  9. Azizi O, Mahesri A, Lee BC, Patel SJ, Horowitz M (2010) Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In: Proceedings of the 37th international symposium on computer architecture (ISCA’10), pp. 26–36

  10. Kapre N, DeHon A (2009) Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of the international conference on field programmable logic and applications, pp. 65–72

  11. Truong DN et al (2009) A 167-processor computational platform in 65 nm CMOS. IEEE J Solid State Circuits 44(4):1130–1144

    Article  Google Scholar 

  12. Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41(7):33–38

    Article  Google Scholar 

  13. Borkar S (2007) Thousand core chips—a technology perspective. In: Proceedings of the design automation conference (DAC), pp. 746–749

  14. Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s Law and its implications for multicore design. In: Proceedings of the 37th international symposium on computer, architecture (ISCA’10), pp. 362–370

  15. Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E (2008) Register file power reduction using bypass sensitive compiler. IEEE Trans Comput Aided Des Integr Circuits Syst 27(6):1155–1159

    Article  Google Scholar 

  16. Nalluri R, Garg R, Panda PR (2007) Customization of register file banking architecture for low power. In: Proceedings of the 20th international conference on VLSI design (VLSID’07), pp. 239–244

  17. Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(10):1259–1267

    Article  Google Scholar 

  18. Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extensions under microarchitectural constraints. In: Proceedings of the design automation conference (DAC), pp. 256–261

  19. Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th Annu. IEEE/ACM, MICRO, pp. 129–140

  20. Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: Proceedings of the CASES, pp. 69–78

  21. Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans Comput Aided Des Integr Circuits Syst 25:1209–1229

    Article  Google Scholar 

  22. Chen X, Maskell DL, Sun Y (2007) Fast identification of custom instructions for extensible processors. IEEE Trans Comput Aided Des Integr Circuits Syst 26(2):359–368

    Article  MATH  Google Scholar 

  23. Zyuban VV, Kogge PM (1998) The energy complexity of register files. In: Proceedings of the international symposium on low power, electronic design, pp. 305–310

  24. Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: Proceedings of the design, automation & test in Europe (DATE)

  25. Altera Corp. Nios processor reference handbook

  26. Xilinx Inc., Microblaze soft processor core

  27. Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE Micro 20:60–70

    Article  Google Scholar 

  28. Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing data-bandwidth to instruction-set extensions through register clustering. In: Proceedings of the international conference on computer aided design, pp. 166–177

  29. Fischer JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture. Elsevier Inc, Compiler and Tools, Amsterdam

  30. Kim NS, Mudge T (2003) Reducing register ports using delayed write-back queues and operand pre-fetch. In: Proceedings of the 17th annual international conference on Supercomputing, pp. 172–182

  31. Pozzi L, Ienne P (2005) Exploiting pipelining to relax register-file port constraints of instruction set extensions. In: Proceedings of the international conference on compilers, architectures and synthesis for embedded systems, pp. 2–10

  32. Atasu K, Dimond R, Mencer O, Luk W, Özturan C, Dünda G (2007) Optimizing instruction-set extensible processors under data bandwidth constraints. In: Proceedings of the design automation and test in, Europe, Mar. 2007, pp. 588–593

  33. Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction processor synthesis. IEEE Trans Comput Aided Des Integr Circuits Syst 27(3):528–541

    Article  Google Scholar 

  34. Verma Ajay K, Brisk Philip, Ienne Paolo (2010) Fast, nearly optimal ISE identification with I/O serialization through maximal clique enumeration. IEEE Trans Comput Aided Des Integr Circuits Syst 29(3):341–354

    Article  Google Scholar 

  35. Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of the design automation conference (DAC), pp. 395–400

  36. Moreano N, Borin E, de Souza C, Araujo G (2005) Efficient datapath merging for partially reconfigurable architectures. IEEE Trans Comput Aided Des Integr Circuits Syst 24(7):969–980

    Article  Google Scholar 

  37. Dinh Q, Chen D, Wong MDF (2008) Efficient ASIP design for configurable processors with fine-grained resource sharing. In: Proceedings of the ACM/SIGDA 16th international symposium on FPGA, pp. 99–106

  38. Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom instruction set extensions. IEEE Trans Comput Aided Des Integr Circuits Syst 28(12):1788–1801

    Article  Google Scholar 

  39. Hennessy JL, Patterson DA (2005) Computer organization and design: the hardware/software interface, the Morgan Kaufmann Series in computer architecture and design, 3rd edn. Elsevier Inc., Amsterdam

  40. Powell PMD, Vijaykumar TN (2002) Reducing register ports for higher speed and lower energy. In: Proceedings of the 35th annual IEEE/ACM international symposium on microarchitecture, pp. 171–182

  41. Cong J, et al (2005) Instruction set extension with shadow registers for configurable processors. In: Proceedings of the FPGA, pp. 99–106

  42. Liu H, Jayaseelan R, Mitra T (2006) Exploiting forwarding to improve data bandwidth of instruction-set extensions. In: Proceedings of the design automation conference (DAC), pp. 43–48

  43. Chen X, Maskell DL (2007) Supporting multiple-input, multiple-output custom functions in configurable processors. J Syst Architect 53:263–271

    Article  Google Scholar 

  44. Salehi ME, Fakhraie SM (2009) Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development. J Syst Architect 55:373–386

    Article  Google Scholar 

  45. Salehi ME, Fakhraie SM, Yazdanbakhsh A (2012) Instruction set architectural guidelines for embedded packet-processing engines. J Syst Architect 58:112–125

    Article  Google Scholar 

  46. The GNU operating system, available online: http://www.gnu.org

  47. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for custom instruction selection. In: Proceedings of the international conference on future information technology (FutureTech), pp. 1–6

  48. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Locality considerations in exploring custom instruction selection algorithms. In: Proceedings of the ASQED

  49. Yazdanbakhsh A, Kamal M, Salehi ME, Noori H, Fakhraie SM (2010) Energy-aware design space exploration of registerfile for extensible processors. In: Proceedings of the SAMOS

  50. Sakai S, Togasaki M, Yamazaki K (2003) A note on greedy algorithms for the maximum weighted independent set problem. Discret Appl Math 126:313–322

    Article  MATH  MathSciNet  Google Scholar 

  51. Ramaswamy R, Weng N, Wolf T (2009) Analysis of network processing workloads. J Syst Architect 55(10—-12):421–433

    Article  Google Scholar 

  52. Biswas P, Atasu K, Choudhary V, Pozzi L, Dutt N, Ienne P (2004) Introduction of local memory elements in instruction set extensions. In: Proceedings of the 41st design automation conference, June 2004, pp. 729–734

  53. She D, He Y, Corporaal H (2012) Energy efficient special instruction support in an embedded processor with compact ISA. In: proceedings of the CASES, pp. 131–140

  54. Wu D, Ahn J, Lee I, Choi K (2012) Resource-shared custom instruction generation under performance/area constraints. International symposium on system on chip (SoC), pp. 1–6

Download references

Acknowledgments

The authors acknowledge the partial support received for this work under contract 149811/140 from Microelectronic Committee-Research Administration of the University of Tehran.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mostafa E. Salehi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yazdanbakhsh, A., Salehi, M.E. & Fakhraie, S.M. Customized pipeline and instruction set architecture for embedded processing engines. J Supercomput 68, 948–977 (2014). https://doi.org/10.1007/s11227-013-1075-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-1075-8

Keywords

Navigation