Skip to main content
Log in

A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. M. Berekovic, H.-J. Stolberg, and P. Pirsch, “Multi-Core System-On-Chip Architecture for MPEG-4 Streaming Video,” Trans. Circuits Syst. Video Technol. (CSVT), vol. 12, no. 8, 2002, pp. 688–699.

    Article  Google Scholar 

  2. P. Pirsch, M. Berekovic, H.-J. Stolberg, and J. Jachalsky, “VLSI Architectures for MPEG-4 Video,” in VLSI Conference, Taipei, 2003.

  3. Toshiba, “MPEG-4 Audiovisual LSI TC35273,” Tentative Technical Data Sheet, 2000.

  4. ARM, AMBA Specification, http://www.ARM.com.

  5. F. Vahid, “The Softening of Hardware,” IEEE Comput., vol. 36, no. 4, 2003, pp. 27–34.

    Google Scholar 

  6. S. Ishiwata et al., “A Single-Chip MPEG-2 Codec Based on Customizable Media Embedded Processor,” IEEE J. Solid-State Circuits, vol. 38, no. 3, 2003, pp. 530–540.

    Article  Google Scholar 

  7. P. Faraboschi, G. Brown, and J. Fisher,“ Lx: A Technology for Customizable VLIW Embedded Processing,” in Proc. Int’ Symp. On Computer Architecture, Vancover, 2000, pp. 203–213.

  8. H. Zhang, J.M. Rabaey et al., “ A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” in Proc. Int’l. Solid-State Circuits Conference (ISSCC), San Francisco, 2000.

  9. A. Abnous and J. Rabaey, “Low-Power Domain-Specific Multimedia Processors,” in IEEE Workshop on VLSI Signal Processing, San Francisco, 1996, pp. 459–468.

  10. G. A. Slavenburg, S. Rathnam, and H. Diskstra, “The Trimedia TM-1 PCI VLIW Media Processor,” in Proc. Notebook for Hot Chips VIII, Stanford, 1996, pp. 171–177.

  11. M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.

    Article  Google Scholar 

  12. C. McNairy and D. Soltis, “Itanium 2 Processor Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 44–55.

    Article  Google Scholar 

  13. G. Hinton et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technol. J., vol. 5, no. 1, 2001, p. 13. http://www.intel.com.

  14. G. S. Sohi, “Instruction issue logic for high-performance, interruptible, multiple functional unit pipelined computers,” IEEE Trans. Comput., vol. 39, no. 3, 1990, pp. 349–359.

    Article  Google Scholar 

  15. D. Koufaty and D. T. Marr, “Hyperthreading Technology in the Netburst Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 56–64.

    Article  Google Scholar 

  16. Y.-K. Chen, R. Lienhart, E. Debes, M. Holliman, and M. Yeung, “The Impact of SMT/SMP Designs on Multimedia Software Engineering: A Workload Analysis Study,” in Fourth International Symposium on Multimedia Software Engineering, 2002.

  17. H. Oehring, U. Sigmund, and T. Ungerer, “MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors,” in Proceedings of Parallel Architectures and Compilation Techniques PACT99, Newport Beach, 1999.

  18. J. Nickolls et al., “Calisto: A Low-Power Single-Chip Multiprocessor Communications Platform”, IEEE MICRO, vol. 23, 2003, p. 2.

    Article  Google Scholar 

  19. J. Glossner, M. Schulte, M. Moudgill, D. Iancu, S. Jinturkar, T. Raja, G. Nacer, and S. Vassiliadis, “Sandblaster Low-Power Multithreaded SDR Baseband Processor”, in Proc. of the 3rd Workshop on Applications Specific Processors (WASP’04), Stockholm, 2004, pp. 53–58.

  20. D. W. Wall, “Limits of Instruction-Level Parallelism”, in Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 176–188.

  21. G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, “Multiscalar Processors,” in ISCA-25, 1995, pp. 414–425.

  22. R. Balasubramonian et al., “Reducing the Complexity of the Register File in Dynamic Superscalar Processors,” in Micro-34, Austin, 2001, pp. 237–249.

  23. S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. D. Owens, “Register Organisation for Media Processing,” in HPCA-6, 2000, pp. 375–386.

    Google Scholar 

  24. E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace processors,” Proc. of the 30th International Symposium on Microarchitecture, 1999.

  25. T. Sato, Y. Nakamura, and I. Arita, “Revisiting Direct Tag Search Algorithm on Superscalar Processors,” in Workshop on Complexity-Effective Design, Madison, 2001.

  26. E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black, “Hierarchical Scheduling Windows,” in Micro-35, Istanbul, 2002.

  27. R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler, “Design Space Evaluation of Grid Processor Architectures,” in Micro-34, Austin, 2001, pp. 40–53.

  28. M. B. Taylor et al., “The RAW Microprocessor: A Computational Fabric For Software Circuits and General-Purpose Programs,” IEEE MICRO, vol. 22, no. 2, 2002, pp. 25–35.

    Article  Google Scholar 

  29. M. Berekovic and T. Niggemeier, “A Scalable, Multi-Thread, Multi-Issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme,” in Workshop on Embedded Computer Systems SAMOS VI, Samos, 2006, pp. 289–298.

  30. M. Berekovic, S. Moch, and P. Pirsch, “A Scalable, Clustered SMT Processor for Digital Signal Processing,” in ACM SigArch Newsletter, 2004, pp. 62–69.

  31. M. Berekovic, “Eine skalierbare, verteilte Prozessor-Architektur mit simultanem Multi-Threading für Anwendungen der digitalen Signalverarbeitung,” VDI-Fortschritt Berichte Elektronik, Bd. 377, VDI-Verlag, Germany, 2005.

  32. ISO/IEC JTC/SC29/WG11 N4668, Overview of the MPEG-4 Standard, Jeju, March 2002.

  33. M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.

    Article  Google Scholar 

  34. A. Allan, D. Edenfeld, W. H. Joyner, A. B. Kahng, M. Rodgers, and Y. Zorian, “2001 Technology Roadmap for Semiconductors,” IEEE Comput., vol. 35, no. 1, 2002, pp. 42–53.

    Google Scholar 

  35. M. H. Lipasti and J. P. Shen, Modern Processor Design, McGraw-Hill, 2002.

  36. M. Berekovic, H. J. Stolberg, M. B. Kulaczewski, P. Pirsch, H. Möller, H. Runge, J. Kneip, and B. Stabernack, “Instruction Set Extensions for MPEG-4 Video,” J. VLSI Signal Process. Syst., vol. 23, no. 1, 1999, pp. 7–50.

    Article  Google Scholar 

  37. J. P. Wittenburg, W. Hinrichs, J. Kneip, M. Ohmacht, M. Berekovic, H. Lieske, H. Kloos, and P. Pirsch, “Realization of a Programmable Parallel DSP for High Performance Image Processing Applications,” in Design Automation Conference (DAC) 1998, 1998, pp. 56–61.

  38. R. Lee, “Accelerating Multimedia with Enhanced Microprocessors,” IEEE MICRO, vol. 15, no. 2, 1995, pp. 22–32.

    Article  Google Scholar 

  39. A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 35–42.

    Article  Google Scholar 

  40. Texas Instruments, “TMS320DM642 Technical Overview,” Application report SPRU615, 2002.

  41. J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing the IA-64 Architecture,” IEEE MICRO, vol. 20, no. 5, 2000, pp. 12–23.

    Article  Google Scholar 

  42. M. S. Lam and R. P. Wilson, “Limits of Control Flow on Parallelism”, in Proc. 19th Ann. Int’l Symp. on Computer Architecture, 1992, pp. 46–57.

  43. D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” in Proc. 22th Ann. Int’l Symp. on Computer Architecture, 1995, pp. 392–403.

  44. R. P. Preston et al., “Design of an 8-wide Superscalar RISC with Simultaneous Multithreading,” in Solid-State Circuits Conference (ISSCC2002), San Francisco, 2002, pp. 469–471.

  45. S. Palacharla, N. P. Jouppi, and J. Smith, “Complexity Effective Superscalar Processors”, in Proc. 24th. Int’l. Symp. on Computer Architecture, 1997, pp. 206–218.

  46. B. Ackland et al., “A Single Chip, 1.6-Billion, 16-b MAC/s Multiprocessor DSP,” IEEE J. Solid-State Circuits, vol. 35, no. 3, 2000, pp. 412–424.

    Article  Google Scholar 

  47. H.-J. Stolberg, M. Berekovic, L. Friebe, S. Moch, S. Flügel, X. Mao, M. B. Kulaczewski, H. Klußmann, and P. Pirsch, “HiBRID-SoC: A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications,” in Proc. Design, Automation and Test in Europe (DATE2003)—Designer’s Forum, 2003, pp. 8–13.

  48. K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning”, in Proc. 30th. Int’l. Symp. On Microarchitecure, 1997, pp.149–159.

  49. R. E. Kessler, “The Alpha 21264 Microprocessor”, IEEE MICRO, vol. 19, no. 2, 1999, pp. 24–36.

    Article  MathSciNet  Google Scholar 

  50. R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM J. Res. Develop., vol. 11, no. 1, 1967, pp. 25–33.

    Article  MATH  Google Scholar 

  51. R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires”, Proc. IEEE, vol. 89, no. 4, 2001, pp. 490–504.

    Article  Google Scholar 

  52. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures”, in Proc. 27th Ann. Int’l. Symp on Computer Architecture, 2000, pp. 248–259.

  53. J. Leenstra, J. Pille, A. Müller, W. M. Sauer, R. Sautter, and D. F. Wendel, “A 1.8 GHz Instruction Window Buffer for an out-of-order Microprocessor Core,” IEEE J. Solid-State Circuits, vol. 36, no. 11, 2001, pp. 1628–1635.

    Article  Google Scholar 

  54. J. Hoogerbrugge, H. Corporaal, and H. Mulder, “Software pipelining for transport-triggered architectures,” MICRO-24, Albuquerque, 1991.

  55. H. Corporaal, Microprocessor Architectures from VLIW to TTA, Wiley, 1998.

  56. S. J. E. Wilton and N. P. Jouppi, “CACTI: An Enhanced Cache Access and Cycle Time Model,” IEEE J. Solid-State Circuits, vol. 31, no. 5, 1996, pp. 677–688.

    Article  Google Scholar 

  57. P. Shivakumar and N. P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power, and Area Model,” in WRL Technical Report 2001/2, 2001.

  58. S. Vangal et al., “5-Ghz 32-bit Integer Execution Core in 130-nm Dual-VT CMOS,” IEEE J. Solid-State Circuits, vol. 37, no. 11, 2002.

  59. C. A. R. Hoare, “Communicating Sequential Processes,” Commun. ACM, vol. 21, no. 8, 1978.

  60. ISO/IEC 14496-2:1999, Coding of Audio-Visual Objects—Part 2: Visual, ISO/IEC, 1999.

  61. T. Niggemeier, “Multithreaded Implementierung von MPEG-4 Global Motion Compensation auf einer SMT-Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.

  62. S. Bähre, “Multithreaded Implementierung von MPEG-4 Postprocessing Algorithmen auf einer SMT -Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.

  63. M. Berekovic, H.-J. Stolberg, S. Flügel, S. Moch, M. B. Kulaczewski, L. Friebe, J. Hilgenstock, X. Mao, H. Klussmann, and P. Pirsch, “Implementing the MPEG-4 AS Profile on a Multi-Core System on Chip Architecture,” in Proc. of 3rd Workshop and Exhibition on MPEG-4 (WEMP4), 2002.

  64. H.-J. Stolberg, M. Berekovic, and P. Pirsch, “A Platform-Independent Methodology for Performance Estimation of Streaming Media Applications,” in Proc. 2002 IEEE International Conference on Multimedia and EXPO (ICME2002), 2002, CD-ROM.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mladen Berekovic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berekovic, M., Berekovic, M. & Niggemeier, T. A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance. J Sign Process Syst Sign Image 50, 201–229 (2008). https://doi.org/10.1007/s11265-007-0138-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-007-0138-6

Keywords

Navigation