Skip to main content

Advertisement

Log in

Address Generation Optimization for Embedded High-Performance Processors: A Survey

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized by having a complex array index manipulation and a large number of data accesses. Those applications require high performance specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction word architectures seem a good solution providing enough computational performance at low power with the required programmability to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process for embedded systems, explaining current research trends and needs for future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Turley, J. (1999). Embedded processors by the numbers. Embedded Systems Programming, 12(5), 13–14.

    Google Scholar 

  2. Fisher, J. A., Faraboschi, P., & Young, C. (2004). Embedded computing: A VLIW approach to architecture, compilers and tools. Morgan Kaufmann.

  3. Kuhn, P. (2004). Algorithms, complexity analysis and VLSI architectures for MPEG-4 estimation. Kluwer.

  4. Panda, P. R., Nicolau, A., & Dutt, N. (1998). Memory issues in embedded systems-on-chip: Optimizations and exploration. Norwell, MA, USA: Kluwer.

    Google Scholar 

  5. Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., & Marwedel, P. (2002). Scratchpad memory: Design alterna tive for cache on-chip memory in embedded systems. In CODES ’02: Proceedings of the tenth international symposium on hardware/software codesign (pp. 73–78). New York, NY, USA: ACM Press.

    Google Scholar 

  6. Wuytack, S., Catthoor, F., Nachtergaele, L., & De Man, H. (1996). Power exploration for data dominated video applications. In ISLPED ’96: Proceedings of the 1996 international symposium on low power electronics and design (pp. 359–364). Piscataway, NJ, USA: IEEE Press.

    Google Scholar 

  7. Moolenaar, D., Nachtergaele, L., Catthoor, F., & De Man, H. (1997). System-level power exploration for MPEG-2 decoder on embedded cores: A systematic approach. IEEE workshop on signal processing systems (SIPS97) (pp. 395–404) (November). Leicester, UK

  8. Kozyrakis, C., & Patterson, D. (2003). Overcoming the limitations of conventional vector processors. In ISCA ’03: Proceedings of the 30th annual international symposium on computer architecture (pp. 399–409). New York, NY, USA: ACM Press.

    Google Scholar 

  9. Kozyrakis, C. E., & Patterson, D. A. (2003). Scalable vector processors for embedded systems. IEEE Micro, 23(6), 36–45.

    Article  Google Scholar 

  10. Karim, F., Mellan, A., Nguyen, A., Aydonat, U., & Abdelrahman, T. (2004). A multi-level computing architecture for embedded multimedia applications. In Proceedings of the IEEE micro (pp. 55–66).

  11. Fritts, J., Wu, Z., & Wolf, W. (1999). Parallel media processors for the billion transistor era. In Proceedings of the international conference on parallel processing.

  12. Semiconductor Industry Association (2005). International technology roadmap for semiconductors: Design. http://www.itrs.net/links/2005itrs/home2005.htm.

  13. Zarrineh, K., & Upadhyaya, S. J. (1999). A new framework for automatic generation, insertion and verification of memory built-in self test units. In Proceedings of the 17th IEEE VLSI test symposium (pp. 391–396).

  14. Dreibelbis, J., Barth, J., Kalter, H., & Kho, R. (1998). Processor-based built-in self-test for embedded dram. IEEE Journal of Solid-State Circuits, 33(11), 1731–1740, November.

    Article  Google Scholar 

  15. Leupers, R. (2000). Code generation for embedded processors. In ISSS ’00: Proceedings of the 13th international symposium on system synthesis (pp. 173–178). Washington, DC, USA: IEEE Computer Society.

    Chapter  Google Scholar 

  16. Palkovic, M., Brockmeyer, E., Vanbroekhoven, P., Corporaal, H., & Catthoor, F. (2005). Systematic pre processing of data dependent constructs for embedded systems. In Proceedings of PATMOS (pp. 89–98).

  17. Palkovic, M., Corporaal, H., & Catthoor, F. (2005). Global memory optimisation for embedded systems allowed by code duplication. In SCOPES ’05: Proceedings of the 2005 workshop on software and compilers for embedded systems (pp. 72–79). New York, NY, USA: ACM Press.

    Chapter  Google Scholar 

  18. Gheorghita, S. V., Stuijk, S., Basten, T., & Corporaal, H. (2005). Automatic scenario detection for improved wcet estimation. In DAC ’05: Proceedings of the 42nd annual conference on design automation (pp. 101–104). New York, NY, USA: ACM Press.

    Chapter  Google Scholar 

  19. Araujo, G., Ottoni, G., & Cintra, M. (2002). Global array reference allocation. ACM Transactions on Design Automation of Electronic Systems, 7(2), 336–357.

    Article  Google Scholar 

  20. Philips PDSL (2004). http://www.coolfluxdsp.com. CF6 CoolFlux DSP.

  21. TI Inc. (2006). TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (Rev. C). http://www.ti.com/.

  22. Hennessy, J. L., & Patterson, D. A. (2006). Computer architecture: A quantitative approach (4th ed.). Morgan Kauffman.

  23. Catthoor, F. (2002). Data access and storage management for embedded programable processors. Kluwer.

  24. Vanhoof, J., Bolsens, I., Van Rompaey, K., Goossens, G., & De Man, H. (1993). High-level synthesis for real-time digital signal processing. Norwell, MA, USA: Kluwer.

    MATH  Google Scholar 

  25. Grant, D., Denyer, P. B., & Finlay, I. (1989). Synthesis of address generators. In ICCAD-98: IEEE international conference on computer-aided design (pp. 116–119).

  26. Miranda, M., Catthoor, F., & De Man, H. (1994). Address equation multiplexing for realtime signal processing applications. In VLSI signal processing VII (pp. 188–197). New York: La Jolla California.

    Google Scholar 

  27. Miranda, M., Kaspar, M., Catthoor, F., & de Man, H. (1997). Architectural exploration and optimization for counter based hardware address generation. In EDTC ’97: Proceedings of the 1997 European conference on design and test (p. 293). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  28. Miranda, M. A., Catthoor, F., Janssen, M., & De Man, H. J. (1998). High-level address optimization and synthesis techniques for data-transfer-intensive applications. IEEE Transactions on Very Large Scale Integration Systems, 6(4), 677–686.

    Article  Google Scholar 

  29. Miranda, M., Catthoor, F., Janssen, M., & de Man. H. (1996). ADOPT: Efficient hardware address generation in distributed memory architectures. In 9th international symposium on system synthesis (ISSS) (p. 20).

  30. Schmit, H., & Thomas, D. E. (1998). Address generation for memories containing multiple arrays. In IEEETCAD: IEEE transactions on computer-aided design of integrated circuits and systems (Vol. 17).

  31. Hettiaratchi, S., Cheung, P., & Clarke, T. (2002). Performance-area trade-off of address generators for address decoder-decoupled memory. In DATE ’02: Proceedings of the conference on design, automation and test in Europe (p. 902). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  32. Grant, D. M., & Denyer, P. B. (1991). Address generation for array access based on modulus m couters. In EDAC ’91: In proceedings of the 2nd ACM/IEEE European conference on design automation (EDAC) (pp. 118–123).

  33. Lippens, P., Meerbergan, J. V., der Werf, A. V., & Verhaegh, W. (1991). PHIDEO: A silicon compiler for high speed algorithms. In In proceedings of the European conference on design automation (pp. 436–441).

  34. Grant, D. M., Meerbergen, J. V., & Lippens, P. (1994). Optimization of address generator hardware. In DATE ’94: In proceedings of the 5th ACM/IEEE European design and test conference (pp. 325–329).

  35. Mathew, B., & Davis, A. (2004). A loop accelerator for low power embedded vliw processors. In Proc of CODES and ISSS. Stockholm, Sweden, September.

  36. Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann.

    Google Scholar 

  37. Kennedy, K., & Allen, J. R. (2002). Optimizing compilers for modern architectures: A dependence-based approach. San Francisco, CA, USA: Morgan Kaufmann.

    Google Scholar 

  38. Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2006). Compilers: Principles, techniques, and tools (2nd ed.). Boston, MA, USA: Addison Wesley.

    Google Scholar 

  39. Aho, A. V., Sethi, R., & Ullman, J. D. (1986). Compilers: Principles, techniques, and tools. Boston, MA, USA: Addison Wesley.

    Google Scholar 

  40. Liem, C., Paulin, P., & Jerraya, A. (1996). Address calculation for retargetable compilation and exploration of instruction-set architectures. In DAC ’96: Proceedings of the 33rd annual conference on design automation (pp. 597–600). New York, NY, USA: ACM Press.

    Chapter  Google Scholar 

  41. Liem, C., Paulin, P., & Jerraya, A. (1997). Compilation methods for the address calculation units of embedded processor systems. In In proceedings of the design automation for embedded systems (pp. 61–77). The Netherlands: Springer.

    Google Scholar 

  42. Cheng, W.-K., & Lin, Y.-L. (1998). Addressing optimi zation for loop execution targeting dsp with auto-increment/decrement architecture. In ISSS ’98: Proceedings of the 11th international symposium on system synthesis (pp. 15–20). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  43. Leupers, R. (2000). Code optimization techniques for embedded processors methods, algorithms, and tools. Kluwer.

  44. Ramanujam, J., Krishnamurthy, S., Hong, J., & Kandemir, M. (2002). Address code and arithmetic optimizations for embedded systems. In ASP-DAC ’02: Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI design (p. 619). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  45. Leupers, R., & Marwedel, P. (1996). Algorithms for address assignment in DSP code generation. In ICCAD (pp. 109–112).

  46. Sudarsanam, A., Liao, S., & Devadas, S. (1997). Analysis and evaluation of address arithmetic capabilities in custom dsp architectures. In DAC ’97: Proceedings of the 34th annual conference on design automation (pp. 287–292). New York, NY, USA: ACM Press.

    Chapter  Google Scholar 

  47. Wess, B. (1999). Minimization of data access computation overhead in dsp programs. In In proceedings of design automation for embedded systems (pp. 167–185).

  48. Leupers, R., & David, F. (1998). A uniform optimization technique for offset assignment problems. In ISSS ’98: Proceedings of the 11th international symposium on system synthesis (pp. 3–8). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  49. Basu, A., Leupers, R., & Marwedel, P. (1998). Register-constrained address computation in DSP programs. In DATE ’98: Proceedings of the conference on design, automation and test in Europe (pp. 929–930). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  50. Gupta, S., Miranda, M., Catthoor, F., & Gupta, R. (2000). Analysis of high-level address code transformations for programmable processors. In DATE ’00: Proceedings of the conference on design, automation and test in Europe (pp. 9–13). New York, NY, USA: ACM Press.

    Chapter  Google Scholar 

  51. Ghez, C., Miranda, M., Vandecappelle, A., Catthoor, F., & Verkest, D. (2000). Systematic high-level address code transformations for piece-wise linear indexing: Illustration on a medical imaging algorithm. In Proceedings of the IEEE workshop on signal processing systems (pp. 623–632). IEEE Press.

  52. Catthoor, F., Danckaert, K., Kulkarni, C., & Omnes, T. (2001). Programmable digital signal processors: Architecture, programming, and applications. New York, USA: Marcel Dekker.

    Google Scholar 

  53. Gonzalez, R., & Horowitz, M. (1996). Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31, 1277–1284.

    Article  Google Scholar 

  54. Palkovic, M., Miranda, M., Catthoor, F., & Verkest, D. (2001). System design automation—Fundamentals, principles, methods, examples. Chapter high level condition expression transformations for desing exploration (pp. 56–64). Boston, USA: Kluwer, March.

    Google Scholar 

  55. Palkovic, M., Miranda, M., Denolf, K., Vos, P., & Catthoor, F. (2002). Systematic address and control code transformations for performance optimisation of a MPEG-4 video decoder. In ASP-DAC ’02: Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI design (p. 547). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  56. Palkovic, M., Miranda, M., & Catthoor, F. (2002). Systematic power-performance trade-off in MPEG-4 by means of selective function inlining steered by address optimization opportunities. In DATE ’02: Proceedings of the conference on design, automation and test in Europe (p. 1072). Washington, DC, USA: IEEE Computer Society.

    Chapter  Google Scholar 

  57. Falk, H., & Marwedel, P. (2003). Control flow driven splitting of loop nests at the source code level. In DATE ’03: Proceedings of the conference on design, automation and test in Europe (pp. 410–415). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  58. Falk, H., & Verma, M. (2004). Combined data partitioning and loop nest splitting for energy consumption minimization. In SCOPES’04: Proceedings of the 8th workshop on software and compilers for embedded systems, September.

  59. Falk, H. (2005). Control flow driven code hoisting at the source code level. In ODES’05: Proceedings of the 3rd work shop on optimizations for DSP and embedded systems, March.

  60. Falk, H., & Marwedel, P. (2004). Source code optimization techniques for data flow dominated embedded software. Springer.

  61. Flynn, M. J., Hung, P., & Rudd, K. W. (1999). Deep-submicron microprocessor design issues. IEEE MICRO, 19(4), 11–22, July–August.

    Article  Google Scholar 

  62. DeMan, H. (2005). Ambient intelligence: Giga-scale dreams and nano-scale realities. In Proc of ISSCC, keynote speech, February.

  63. Jacome, M. F., & de Veciana, G. (2000). Design challenges for new application-specific processors. IEEE Design and Test, 17(2), 40–50.

    Article  Google Scholar 

  64. CSEM (2006). Low-power digital signal processing (MACGIC DSP). http://www.macgic.com.

  65. Arm, C., Masgonty, J.-M., Morgan, M., Piguet, C., Pfister, P.-D., Rampogna, F., et al. (2006). Low-power quad-MAC 170 µW/MHz 1.0 V MACGIC DSP core. In ESSCIRC’06: Proceedings of the 32st European solid-state circuits conference.

  66. Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., et al. (2001). Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems, 6(2), 149–206.

    Article  Google Scholar 

  67. Mathew, S., Anders, M., Krishnamurthy, R. K., & Borkar, S. (2003). A 4-GHz 130-nm address generation unit with 32-bit sparse-tree adder core. IEEE Journal of Solid-State Circuits, 38(5), 126–127, May.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillermo Talavera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Talavera, G., Jayapala, M., Carrabina, J. et al. Address Generation Optimization for Embedded High-Performance Processors: A Survey. J Sign Process Syst Sign Image Video Technol 53, 271–284 (2008). https://doi.org/10.1007/s11265-008-0165-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0165-y

Keywords

Navigation