Abstract
The use of distributed loop buffer architectures with incompatible loop-nest organisations allows the execution of incompatible loops in parallel with minimal hardware overhead. Due to this fact, the utilisation of these distributed and scalable architectures in embedded systems is a promising option to improve the energy efficiency of the instruction memory organisations that exist in these systems. This paper proposes and analyses non-overlapping and complementary implementation options for distinct partitions of the design space that is related to distributed loop buffer architectures. The high-level trade-off analysis of the proposed implementations is crucial to present the correct process design that an embedded systems designer has to follow in order to have an efficient distributed loop buffer architecture for a certain application. Results show that, with an increase of about 6.5 % in the energy consumption of the control logic that exists in the instruction memory organisation, the overall energy consumption of the instruction memory organisation can be reduced by 6 % to 22 %, when distributed loop buffer architectures with incompatible loop-nest organisations are used instead of clustered loop buffer architectures with shared loop-nest organisations architectures.













Similar content being viewed by others
References
Bajwa, R.S., Hiraki, M., Kojima, H., Gorny, D.J., Nitta, K., Shridhar, A., et al. (1997). Instruction buffering to reduce power in processors for signal processing. Journal of IEEE Transactions on VLSI Systems, 5(4), 417–424.
Banakar, R., Steinke, S., Bo-Sik, L., Balakrishnan, M., Marwedel, P. (2002). Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. Proceedings of the tenth international symposium on hardware/software codesign (pp. 73–78).
Benini, L., Macii, A., Poncino, M. (2000). A recursive algorithm for low-power memory partitioning. Proceedings of the 2000 international symposium on low power electronics and design (pp. 78–83).
Catthoor, F., Raghavan, P., Lambrechts, A., Jayapala, M., Kritikakou, A., Absar, J. (2010). Ultra-low energy domain-specific instruction-set processors. Berlin: Springer
Chalasani, S., & Conrad, J.M. (2008). A survey of energy harvesting sources for embedded systems. In Proceedings of IEEE Southeast conference (pp. 442–447).
De Man, H. (2005). Ambient intelligence: gigascale dreams and nanoscale realities. IEEE international solid-state circuits conference, 1, 29–35.
Gomez, J.I., Marchal, P., Verdoorlaege, S., Pinuel, L., Catthoor, F. (2004). Optimizing the memory bandwidth with loop morphing. In Proceedings of the 15th IEEE international conference on application-specific systems, architectures, and processors (pp. 213–223).
Jayapala, M., Barat, F., Op De Beeck, P., Lauwereins, R., Catthoor, F., Deconinck, G. (2001). Low energy clustered instruction fetch and split loop cache architecture for long instruction word processors. In Proceedings of the workshop on compilers and operating systems for low power (pp. 1–8).
Jayapala, M., Barat, F., Vander Aa, T., Catthoor, F., Corporaal, H., Deconinck, G. (2005). Clustered loop buffer organization for low energy VLIW embedded processors. IEEE Transactions on Computers, 54(6), 672–683.
Kandemir, M., Kolcu, I., Kadayif, I. (2002). Influence of loop optimizations on energy consumption of multi-bank memory systems. Proceedings of the 11th international conference on compiler construction (pp. 276–292).
Kandemir, M., Kadayif, I., Choudhary, A., Ramanujam, J., Kolcu, I. (2004). Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(3), 281–287.
Kin, J., Gupta, M., Mangione-Smith, W.H. (1997). The filter cache: an energy efficient memory structure. In Proceedings of international symposium on microarchitecture (pp. 184-193).
Lin, T., Pengwei, H., Shufang, X. (2005). Factoring m-band wavelet transforms into reversible integer mappings and lifting steps. IEEE International Conference on Acoustics, Speech, and Signal Processing, 4, 629–632.
Lyuh, C., & Taewhan, K. (2004). Memory access scheduling and binding considering energy minimization in multi-bank memory systems. In Proceedings of the conference on design automation and test in Europe (pp. 81–86).
Psychou, G., Fasthuber, R., Catthoor, F., Hulzink, J., Huisken, J. (2012). Sub-word handling in data-parallel mapping. ARCS workshops (pp. 1–7).
Shuren, Q., & Zhong, J. (2004). Multi-resolution time-frequency analysis for detection of rhythms of EEG signals. Digital signal processing workshop (pp. 338-341).
Raghavan, P., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D. (2006). Distributed loop controller architecture for multi-threading in uni-threaded VLIW Processors. In Proceedings of the design automation, and test in Europe (pp. 1–6).
Tsekoura, I., Selimis, G., Hulzink, J., Catthoor, F., Huisken, J., de Groot, H., et al. (2010). Exploration of cryptographic ASIP designs for wireless sensor nodes. 17th IEEE international conference on electronics circuits and systems (ICECS) (pp. 827–830).
Verma, M., & Marwedel, P. (2007). Advanced memory optimization techniques for low-power embedded processors. Berlin: Springer.
Villarreal, J., Lysecky, R., Cotterell, S., Vahid, F. (2001). A study on the loop behavior of embedded programs. UCR-CSE-01-03. University of California, Riverside.
Vivekanandarajah, K., Srikanthan, T., Bhattacharyya, S. (2004). Dynamic filter cache for low power instruction memory hierarchy. In Proceedings of the euromicro symposium on digital system design (pp. 607–610).
Xiaobo, F., Ellis, C.S., Lebeck, A.R. (2001). Memory controller policies for DRAM power management. International Symposium on Low Power Electronics and Design (pp. 129–134).
Yassin, Y.H., Kjeldsberg, P.G., Hulzink, J., Romero, I., Huisken, J. (2009). Ultra low power application specific instruction-set processor design for a cardiac beat detector algorithm. In Proceedings of the NORCHIP (pp. 1–4).
Zhong, H., Fan, K., Mahlke, S., Schlansker, M. (2005). A distributed control path architecture for VLIW processors. In Proceedings of the international conference on parallel architectures and compilation techniques (pp. 197–206).
Zhong, H., Lieberman, S.A., Mahlke, S.A. (2007). Extending multicore architectures to exploit hybrid parallelism in single-thread applications. In Proceedings of the international symposium on high performance computer architecture (pp. 25–36).
Acknowledgments
This work is supported by the Spanish Ministry of Science and Innovation, under grant BES-2009-023681, and the Spanish Ministry of Economy and Competitiveness, under grant TEC2012-33892.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Artes, A., Fasthuber, R., Ayala, J.L. et al. Design Space Exploration of Distributed Loop Buffer Architectures with Incompatible Loop-Nest Organisations in Embedded Systems. J Sign Process Syst 72, 69–85 (2013). https://doi.org/10.1007/s11265-013-0749-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-013-0749-z