A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Berekovic, Mladen; Berekovic, Mladen; Niggemeier, Tim

doi:10.1007/s11265-007-0138-6

A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Published: 04 October 2007

Volume 50, pages 201–229, (2008)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Mladen Berekovic¹,
Mladen Berekovic² &
Tim Niggemeier³

103 Accesses
6 Citations
Explore all metrics

Abstract

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multicore Systems on Chip

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Article 03 January 2021

Akshay Srivatsa, Mostafa Mansour, … Andreas Herkersdorf

Multiprocessing in Embedded Systems

References

M. Berekovic, H.-J. Stolberg, and P. Pirsch, “Multi-Core System-On-Chip Architecture for MPEG-4 Streaming Video,” Trans. Circuits Syst. Video Technol. (CSVT), vol. 12, no. 8, 2002, pp. 688–699.
Article Google Scholar
P. Pirsch, M. Berekovic, H.-J. Stolberg, and J. Jachalsky, “VLSI Architectures for MPEG-4 Video,” in VLSI Conference, Taipei, 2003.
Toshiba, “MPEG-4 Audiovisual LSI TC35273,” Tentative Technical Data Sheet, 2000.
ARM, AMBA Specification, http://www.ARM.com.
F. Vahid, “The Softening of Hardware,” IEEE Comput., vol. 36, no. 4, 2003, pp. 27–34.
Google Scholar
S. Ishiwata et al., “A Single-Chip MPEG-2 Codec Based on Customizable Media Embedded Processor,” IEEE J. Solid-State Circuits, vol. 38, no. 3, 2003, pp. 530–540.
Article Google Scholar
P. Faraboschi, G. Brown, and J. Fisher,“ Lx: A Technology for Customizable VLIW Embedded Processing,” in Proc. Int’ Symp. On Computer Architecture, Vancover, 2000, pp. 203–213.
H. Zhang, J.M. Rabaey et al., “ A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” in Proc. Int’l. Solid-State Circuits Conference (ISSCC), San Francisco, 2000.
A. Abnous and J. Rabaey, “Low-Power Domain-Specific Multimedia Processors,” in IEEE Workshop on VLSI Signal Processing, San Francisco, 1996, pp. 459–468.
G. A. Slavenburg, S. Rathnam, and H. Diskstra, “The Trimedia TM-1 PCI VLIW Media Processor,” in Proc. Notebook for Hot Chips VIII, Stanford, 1996, pp. 171–177.
M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.
Article Google Scholar
C. McNairy and D. Soltis, “Itanium 2 Processor Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 44–55.
Article Google Scholar
G. Hinton et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technol. J., vol. 5, no. 1, 2001, p. 13. http://www.intel.com.
G. S. Sohi, “Instruction issue logic for high-performance, interruptible, multiple functional unit pipelined computers,” IEEE Trans. Comput., vol. 39, no. 3, 1990, pp. 349–359.
Article Google Scholar
D. Koufaty and D. T. Marr, “Hyperthreading Technology in the Netburst Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 56–64.
Article Google Scholar
Y.-K. Chen, R. Lienhart, E. Debes, M. Holliman, and M. Yeung, “The Impact of SMT/SMP Designs on Multimedia Software Engineering: A Workload Analysis Study,” in Fourth International Symposium on Multimedia Software Engineering, 2002.
H. Oehring, U. Sigmund, and T. Ungerer, “MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors,” in Proceedings of Parallel Architectures and Compilation Techniques PACT99, Newport Beach, 1999.
J. Nickolls et al., “Calisto: A Low-Power Single-Chip Multiprocessor Communications Platform”, IEEE MICRO, vol. 23, 2003, p. 2.
Article Google Scholar
J. Glossner, M. Schulte, M. Moudgill, D. Iancu, S. Jinturkar, T. Raja, G. Nacer, and S. Vassiliadis, “Sandblaster Low-Power Multithreaded SDR Baseband Processor”, in Proc. of the 3rd Workshop on Applications Specific Processors (WASP’04), Stockholm, 2004, pp. 53–58.
D. W. Wall, “Limits of Instruction-Level Parallelism”, in Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 176–188.
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, “Multiscalar Processors,” in ISCA-25, 1995, pp. 414–425.
R. Balasubramonian et al., “Reducing the Complexity of the Register File in Dynamic Superscalar Processors,” in Micro-34, Austin, 2001, pp. 237–249.
S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. D. Owens, “Register Organisation for Media Processing,” in HPCA-6, 2000, pp. 375–386.
Google Scholar
E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace processors,” Proc. of the 30th International Symposium on Microarchitecture, 1999.
T. Sato, Y. Nakamura, and I. Arita, “Revisiting Direct Tag Search Algorithm on Superscalar Processors,” in Workshop on Complexity-Effective Design, Madison, 2001.
E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black, “Hierarchical Scheduling Windows,” in Micro-35, Istanbul, 2002.
R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler, “Design Space Evaluation of Grid Processor Architectures,” in Micro-34, Austin, 2001, pp. 40–53.
M. B. Taylor et al., “The RAW Microprocessor: A Computational Fabric For Software Circuits and General-Purpose Programs,” IEEE MICRO, vol. 22, no. 2, 2002, pp. 25–35.
Article Google Scholar
M. Berekovic and T. Niggemeier, “A Scalable, Multi-Thread, Multi-Issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme,” in Workshop on Embedded Computer Systems SAMOS VI, Samos, 2006, pp. 289–298.
M. Berekovic, S. Moch, and P. Pirsch, “A Scalable, Clustered SMT Processor for Digital Signal Processing,” in ACM SigArch Newsletter, 2004, pp. 62–69.
M. Berekovic, “Eine skalierbare, verteilte Prozessor-Architektur mit simultanem Multi-Threading für Anwendungen der digitalen Signalverarbeitung,” VDI-Fortschritt Berichte Elektronik, Bd. 377, VDI-Verlag, Germany, 2005.
ISO/IEC JTC/SC29/WG11 N4668, Overview of the MPEG-4 Standard, Jeju, March 2002.
M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.
Article Google Scholar
A. Allan, D. Edenfeld, W. H. Joyner, A. B. Kahng, M. Rodgers, and Y. Zorian, “2001 Technology Roadmap for Semiconductors,” IEEE Comput., vol. 35, no. 1, 2002, pp. 42–53.
Google Scholar
M. H. Lipasti and J. P. Shen, Modern Processor Design, McGraw-Hill, 2002.
M. Berekovic, H. J. Stolberg, M. B. Kulaczewski, P. Pirsch, H. Möller, H. Runge, J. Kneip, and B. Stabernack, “Instruction Set Extensions for MPEG-4 Video,” J. VLSI Signal Process. Syst., vol. 23, no. 1, 1999, pp. 7–50.
Article Google Scholar
J. P. Wittenburg, W. Hinrichs, J. Kneip, M. Ohmacht, M. Berekovic, H. Lieske, H. Kloos, and P. Pirsch, “Realization of a Programmable Parallel DSP for High Performance Image Processing Applications,” in Design Automation Conference (DAC) 1998, 1998, pp. 56–61.
R. Lee, “Accelerating Multimedia with Enhanced Microprocessors,” IEEE MICRO, vol. 15, no. 2, 1995, pp. 22–32.
Article Google Scholar
A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 35–42.
Article Google Scholar
Texas Instruments, “TMS320DM642 Technical Overview,” Application report SPRU615, 2002.
J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing the IA-64 Architecture,” IEEE MICRO, vol. 20, no. 5, 2000, pp. 12–23.
Article Google Scholar
M. S. Lam and R. P. Wilson, “Limits of Control Flow on Parallelism”, in Proc. 19th Ann. Int’l Symp. on Computer Architecture, 1992, pp. 46–57.
D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” in Proc. 22th Ann. Int’l Symp. on Computer Architecture, 1995, pp. 392–403.
R. P. Preston et al., “Design of an 8-wide Superscalar RISC with Simultaneous Multithreading,” in Solid-State Circuits Conference (ISSCC2002), San Francisco, 2002, pp. 469–471.
S. Palacharla, N. P. Jouppi, and J. Smith, “Complexity Effective Superscalar Processors”, in Proc. 24th. Int’l. Symp. on Computer Architecture, 1997, pp. 206–218.
B. Ackland et al., “A Single Chip, 1.6-Billion, 16-b MAC/s Multiprocessor DSP,” IEEE J. Solid-State Circuits, vol. 35, no. 3, 2000, pp. 412–424.
Article Google Scholar
H.-J. Stolberg, M. Berekovic, L. Friebe, S. Moch, S. Flügel, X. Mao, M. B. Kulaczewski, H. Klußmann, and P. Pirsch, “HiBRID-SoC: A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications,” in Proc. Design, Automation and Test in Europe (DATE2003)—Designer’s Forum, 2003, pp. 8–13.
K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning”, in Proc. 30th. Int’l. Symp. On Microarchitecure, 1997, pp.149–159.
R. E. Kessler, “The Alpha 21264 Microprocessor”, IEEE MICRO, vol. 19, no. 2, 1999, pp. 24–36.
Article MathSciNet Google Scholar
R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM J. Res. Develop., vol. 11, no. 1, 1967, pp. 25–33.
Article MATH Google Scholar
R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires”, Proc. IEEE, vol. 89, no. 4, 2001, pp. 490–504.
Article Google Scholar
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures”, in Proc. 27th Ann. Int’l. Symp on Computer Architecture, 2000, pp. 248–259.
J. Leenstra, J. Pille, A. Müller, W. M. Sauer, R. Sautter, and D. F. Wendel, “A 1.8 GHz Instruction Window Buffer for an out-of-order Microprocessor Core,” IEEE J. Solid-State Circuits, vol. 36, no. 11, 2001, pp. 1628–1635.
Article Google Scholar
J. Hoogerbrugge, H. Corporaal, and H. Mulder, “Software pipelining for transport-triggered architectures,” MICRO-24, Albuquerque, 1991.
H. Corporaal, Microprocessor Architectures from VLIW to TTA, Wiley, 1998.
S. J. E. Wilton and N. P. Jouppi, “CACTI: An Enhanced Cache Access and Cycle Time Model,” IEEE J. Solid-State Circuits, vol. 31, no. 5, 1996, pp. 677–688.
Article Google Scholar
P. Shivakumar and N. P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power, and Area Model,” in WRL Technical Report 2001/2, 2001.
S. Vangal et al., “5-Ghz 32-bit Integer Execution Core in 130-nm Dual-V_T CMOS,” IEEE J. Solid-State Circuits, vol. 37, no. 11, 2002.
C. A. R. Hoare, “Communicating Sequential Processes,” Commun. ACM, vol. 21, no. 8, 1978.
ISO/IEC 14496-2:1999, Coding of Audio-Visual Objects—Part 2: Visual, ISO/IEC, 1999.
T. Niggemeier, “Multithreaded Implementierung von MPEG-4 Global Motion Compensation auf einer SMT-Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.
S. Bähre, “Multithreaded Implementierung von MPEG-4 Postprocessing Algorithmen auf einer SMT -Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.
M. Berekovic, H.-J. Stolberg, S. Flügel, S. Moch, M. B. Kulaczewski, L. Friebe, J. Hilgenstock, X. Mao, H. Klussmann, and P. Pirsch, “Implementing the MPEG-4 AS Profile on a Multi-Core System on Chip Architecture,” in Proc. of 3rd Workshop and Exhibition on MPEG-4 (WEMP4), 2002.
H.-J. Stolberg, M. Berekovic, and P. Pirsch, “A Platform-Independent Methodology for Performance Estimation of Streaming Media Applications,” in Proc. 2002 IEEE International Conference on Multimedia and EXPO (ICME2002), 2002, CD-ROM.

Download references

Author information

Authors and Affiliations

IMEC, Eindhoven, The Netherlands
Mladen Berekovic
TU Delft, Delft, The Netherlands
Mladen Berekovic
IBM Deutschland Entwicklung GmbH, Böblingen, Germany
Tim Niggemeier

Authors

Mladen Berekovic
View author publications
You can also search for this author in PubMed Google Scholar
Mladen Berekovic
View author publications
You can also search for this author in PubMed Google Scholar
Tim Niggemeier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mladen Berekovic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berekovic, M., Berekovic, M. & Niggemeier, T. A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance. J Sign Process Syst Sign Image 50, 201–229 (2008). https://doi.org/10.1007/s11265-007-0138-6

Download citation

Received: 18 February 2007
Revised: 24 April 2007
Accepted: 09 August 2007
Published: 04 October 2007
Issue Date: February 2008
DOI: https://doi.org/10.1007/s11265-007-0138-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Abstract

Access this article

Similar content being viewed by others

Multicore Systems on Chip

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Multiprocessing in Embedded Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Abstract

Access this article

Similar content being viewed by others

Multicore Systems on Chip

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Multiprocessing in Embedded Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation