Abstract
Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized by having a complex array index manipulation and a large number of data accesses. Those applications require high performance specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction word architectures seem a good solution providing enough computational performance at low power with the required programmability to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process for embedded systems, explaining current research trends and needs for future.
Similar content being viewed by others
References
Turley, J. (1999). Embedded processors by the numbers. Embedded Systems Programming, 12(5), 13–14.
Fisher, J. A., Faraboschi, P., & Young, C. (2004). Embedded computing: A VLIW approach to architecture, compilers and tools. Morgan Kaufmann.
Kuhn, P. (2004). Algorithms, complexity analysis and VLSI architectures for MPEG-4 estimation. Kluwer.
Panda, P. R., Nicolau, A., & Dutt, N. (1998). Memory issues in embedded systems-on-chip: Optimizations and exploration. Norwell, MA, USA: Kluwer.
Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., & Marwedel, P. (2002). Scratchpad memory: Design alterna tive for cache on-chip memory in embedded systems. In CODES ’02: Proceedings of the tenth international symposium on hardware/software codesign (pp. 73–78). New York, NY, USA: ACM Press.
Wuytack, S., Catthoor, F., Nachtergaele, L., & De Man, H. (1996). Power exploration for data dominated video applications. In ISLPED ’96: Proceedings of the 1996 international symposium on low power electronics and design (pp. 359–364). Piscataway, NJ, USA: IEEE Press.
Moolenaar, D., Nachtergaele, L., Catthoor, F., & De Man, H. (1997). System-level power exploration for MPEG-2 decoder on embedded cores: A systematic approach. IEEE workshop on signal processing systems (SIPS97) (pp. 395–404) (November). Leicester, UK
Kozyrakis, C., & Patterson, D. (2003). Overcoming the limitations of conventional vector processors. In ISCA ’03: Proceedings of the 30th annual international symposium on computer architecture (pp. 399–409). New York, NY, USA: ACM Press.
Kozyrakis, C. E., & Patterson, D. A. (2003). Scalable vector processors for embedded systems. IEEE Micro, 23(6), 36–45.
Karim, F., Mellan, A., Nguyen, A., Aydonat, U., & Abdelrahman, T. (2004). A multi-level computing architecture for embedded multimedia applications. In Proceedings of the IEEE micro (pp. 55–66).
Fritts, J., Wu, Z., & Wolf, W. (1999). Parallel media processors for the billion transistor era. In Proceedings of the international conference on parallel processing.
Semiconductor Industry Association (2005). International technology roadmap for semiconductors: Design. http://www.itrs.net/links/2005itrs/home2005.htm.
Zarrineh, K., & Upadhyaya, S. J. (1999). A new framework for automatic generation, insertion and verification of memory built-in self test units. In Proceedings of the 17th IEEE VLSI test symposium (pp. 391–396).
Dreibelbis, J., Barth, J., Kalter, H., & Kho, R. (1998). Processor-based built-in self-test for embedded dram. IEEE Journal of Solid-State Circuits, 33(11), 1731–1740, November.
Leupers, R. (2000). Code generation for embedded processors. In ISSS ’00: Proceedings of the 13th international symposium on system synthesis (pp. 173–178). Washington, DC, USA: IEEE Computer Society.
Palkovic, M., Brockmeyer, E., Vanbroekhoven, P., Corporaal, H., & Catthoor, F. (2005). Systematic pre processing of data dependent constructs for embedded systems. In Proceedings of PATMOS (pp. 89–98).
Palkovic, M., Corporaal, H., & Catthoor, F. (2005). Global memory optimisation for embedded systems allowed by code duplication. In SCOPES ’05: Proceedings of the 2005 workshop on software and compilers for embedded systems (pp. 72–79). New York, NY, USA: ACM Press.
Gheorghita, S. V., Stuijk, S., Basten, T., & Corporaal, H. (2005). Automatic scenario detection for improved wcet estimation. In DAC ’05: Proceedings of the 42nd annual conference on design automation (pp. 101–104). New York, NY, USA: ACM Press.
Araujo, G., Ottoni, G., & Cintra, M. (2002). Global array reference allocation. ACM Transactions on Design Automation of Electronic Systems, 7(2), 336–357.
Philips PDSL (2004). http://www.coolfluxdsp.com. CF6 CoolFlux DSP.
TI Inc. (2006). TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (Rev. C). http://www.ti.com/.
Hennessy, J. L., & Patterson, D. A. (2006). Computer architecture: A quantitative approach (4th ed.). Morgan Kauffman.
Catthoor, F. (2002). Data access and storage management for embedded programable processors. Kluwer.
Vanhoof, J., Bolsens, I., Van Rompaey, K., Goossens, G., & De Man, H. (1993). High-level synthesis for real-time digital signal processing. Norwell, MA, USA: Kluwer.
Grant, D., Denyer, P. B., & Finlay, I. (1989). Synthesis of address generators. In ICCAD-98: IEEE international conference on computer-aided design (pp. 116–119).
Miranda, M., Catthoor, F., & De Man, H. (1994). Address equation multiplexing for realtime signal processing applications. In VLSI signal processing VII (pp. 188–197). New York: La Jolla California.
Miranda, M., Kaspar, M., Catthoor, F., & de Man, H. (1997). Architectural exploration and optimization for counter based hardware address generation. In EDTC ’97: Proceedings of the 1997 European conference on design and test (p. 293). Washington, DC, USA: IEEE Computer Society.
Miranda, M. A., Catthoor, F., Janssen, M., & De Man, H. J. (1998). High-level address optimization and synthesis techniques for data-transfer-intensive applications. IEEE Transactions on Very Large Scale Integration Systems, 6(4), 677–686.
Miranda, M., Catthoor, F., Janssen, M., & de Man. H. (1996). ADOPT: Efficient hardware address generation in distributed memory architectures. In 9th international symposium on system synthesis (ISSS) (p. 20).
Schmit, H., & Thomas, D. E. (1998). Address generation for memories containing multiple arrays. In IEEETCAD: IEEE transactions on computer-aided design of integrated circuits and systems (Vol. 17).
Hettiaratchi, S., Cheung, P., & Clarke, T. (2002). Performance-area trade-off of address generators for address decoder-decoupled memory. In DATE ’02: Proceedings of the conference on design, automation and test in Europe (p. 902). Washington, DC, USA: IEEE Computer Society.
Grant, D. M., & Denyer, P. B. (1991). Address generation for array access based on modulus m couters. In EDAC ’91: In proceedings of the 2nd ACM/IEEE European conference on design automation (EDAC) (pp. 118–123).
Lippens, P., Meerbergan, J. V., der Werf, A. V., & Verhaegh, W. (1991). PHIDEO: A silicon compiler for high speed algorithms. In In proceedings of the European conference on design automation (pp. 436–441).
Grant, D. M., Meerbergen, J. V., & Lippens, P. (1994). Optimization of address generator hardware. In DATE ’94: In proceedings of the 5th ACM/IEEE European design and test conference (pp. 325–329).
Mathew, B., & Davis, A. (2004). A loop accelerator for low power embedded vliw processors. In Proc of CODES and ISSS. Stockholm, Sweden, September.
Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann.
Kennedy, K., & Allen, J. R. (2002). Optimizing compilers for modern architectures: A dependence-based approach. San Francisco, CA, USA: Morgan Kaufmann.
Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2006). Compilers: Principles, techniques, and tools (2nd ed.). Boston, MA, USA: Addison Wesley.
Aho, A. V., Sethi, R., & Ullman, J. D. (1986). Compilers: Principles, techniques, and tools. Boston, MA, USA: Addison Wesley.
Liem, C., Paulin, P., & Jerraya, A. (1996). Address calculation for retargetable compilation and exploration of instruction-set architectures. In DAC ’96: Proceedings of the 33rd annual conference on design automation (pp. 597–600). New York, NY, USA: ACM Press.
Liem, C., Paulin, P., & Jerraya, A. (1997). Compilation methods for the address calculation units of embedded processor systems. In In proceedings of the design automation for embedded systems (pp. 61–77). The Netherlands: Springer.
Cheng, W.-K., & Lin, Y.-L. (1998). Addressing optimi zation for loop execution targeting dsp with auto-increment/decrement architecture. In ISSS ’98: Proceedings of the 11th international symposium on system synthesis (pp. 15–20). Washington, DC, USA: IEEE Computer Society.
Leupers, R. (2000). Code optimization techniques for embedded processors methods, algorithms, and tools. Kluwer.
Ramanujam, J., Krishnamurthy, S., Hong, J., & Kandemir, M. (2002). Address code and arithmetic optimizations for embedded systems. In ASP-DAC ’02: Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI design (p. 619). Washington, DC, USA: IEEE Computer Society.
Leupers, R., & Marwedel, P. (1996). Algorithms for address assignment in DSP code generation. In ICCAD (pp. 109–112).
Sudarsanam, A., Liao, S., & Devadas, S. (1997). Analysis and evaluation of address arithmetic capabilities in custom dsp architectures. In DAC ’97: Proceedings of the 34th annual conference on design automation (pp. 287–292). New York, NY, USA: ACM Press.
Wess, B. (1999). Minimization of data access computation overhead in dsp programs. In In proceedings of design automation for embedded systems (pp. 167–185).
Leupers, R., & David, F. (1998). A uniform optimization technique for offset assignment problems. In ISSS ’98: Proceedings of the 11th international symposium on system synthesis (pp. 3–8). Washington, DC, USA: IEEE Computer Society.
Basu, A., Leupers, R., & Marwedel, P. (1998). Register-constrained address computation in DSP programs. In DATE ’98: Proceedings of the conference on design, automation and test in Europe (pp. 929–930). Washington, DC, USA: IEEE Computer Society.
Gupta, S., Miranda, M., Catthoor, F., & Gupta, R. (2000). Analysis of high-level address code transformations for programmable processors. In DATE ’00: Proceedings of the conference on design, automation and test in Europe (pp. 9–13). New York, NY, USA: ACM Press.
Ghez, C., Miranda, M., Vandecappelle, A., Catthoor, F., & Verkest, D. (2000). Systematic high-level address code transformations for piece-wise linear indexing: Illustration on a medical imaging algorithm. In Proceedings of the IEEE workshop on signal processing systems (pp. 623–632). IEEE Press.
Catthoor, F., Danckaert, K., Kulkarni, C., & Omnes, T. (2001). Programmable digital signal processors: Architecture, programming, and applications. New York, USA: Marcel Dekker.
Gonzalez, R., & Horowitz, M. (1996). Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31, 1277–1284.
Palkovic, M., Miranda, M., Catthoor, F., & Verkest, D. (2001). System design automation—Fundamentals, principles, methods, examples. Chapter high level condition expression transformations for desing exploration (pp. 56–64). Boston, USA: Kluwer, March.
Palkovic, M., Miranda, M., Denolf, K., Vos, P., & Catthoor, F. (2002). Systematic address and control code transformations for performance optimisation of a MPEG-4 video decoder. In ASP-DAC ’02: Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI design (p. 547). Washington, DC, USA: IEEE Computer Society.
Palkovic, M., Miranda, M., & Catthoor, F. (2002). Systematic power-performance trade-off in MPEG-4 by means of selective function inlining steered by address optimization opportunities. In DATE ’02: Proceedings of the conference on design, automation and test in Europe (p. 1072). Washington, DC, USA: IEEE Computer Society.
Falk, H., & Marwedel, P. (2003). Control flow driven splitting of loop nests at the source code level. In DATE ’03: Proceedings of the conference on design, automation and test in Europe (pp. 410–415). Washington, DC, USA: IEEE Computer Society.
Falk, H., & Verma, M. (2004). Combined data partitioning and loop nest splitting for energy consumption minimization. In SCOPES’04: Proceedings of the 8th workshop on software and compilers for embedded systems, September.
Falk, H. (2005). Control flow driven code hoisting at the source code level. In ODES’05: Proceedings of the 3rd work shop on optimizations for DSP and embedded systems, March.
Falk, H., & Marwedel, P. (2004). Source code optimization techniques for data flow dominated embedded software. Springer.
Flynn, M. J., Hung, P., & Rudd, K. W. (1999). Deep-submicron microprocessor design issues. IEEE MICRO, 19(4), 11–22, July–August.
DeMan, H. (2005). Ambient intelligence: Giga-scale dreams and nano-scale realities. In Proc of ISSCC, keynote speech, February.
Jacome, M. F., & de Veciana, G. (2000). Design challenges for new application-specific processors. IEEE Design and Test, 17(2), 40–50.
CSEM (2006). Low-power digital signal processing (MACGIC DSP). http://www.macgic.com.
Arm, C., Masgonty, J.-M., Morgan, M., Piguet, C., Pfister, P.-D., Rampogna, F., et al. (2006). Low-power quad-MAC 170 µW/MHz 1.0 V MACGIC DSP core. In ESSCIRC’06: Proceedings of the 32st European solid-state circuits conference.
Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., et al. (2001). Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems, 6(2), 149–206.
Mathew, S., Anders, M., Krishnamurthy, R. K., & Borkar, S. (2003). A 4-GHz 130-nm address generation unit with 32-bit sparse-tree adder core. IEEE Journal of Solid-State Circuits, 38(5), 126–127, May.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Talavera, G., Jayapala, M., Carrabina, J. et al. Address Generation Optimization for Embedded High-Performance Processors: A Survey. J Sign Process Syst Sign Image Video Technol 53, 271–284 (2008). https://doi.org/10.1007/s11265-008-0165-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0165-y