Abstract
A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.
Similar content being viewed by others
References
M. Berekovic, H.-J. Stolberg, and P. Pirsch, “Multi-Core System-On-Chip Architecture for MPEG-4 Streaming Video,” Trans. Circuits Syst. Video Technol. (CSVT), vol. 12, no. 8, 2002, pp. 688–699.
P. Pirsch, M. Berekovic, H.-J. Stolberg, and J. Jachalsky, “VLSI Architectures for MPEG-4 Video,” in VLSI Conference, Taipei, 2003.
Toshiba, “MPEG-4 Audiovisual LSI TC35273,” Tentative Technical Data Sheet, 2000.
ARM, AMBA Specification, http://www.ARM.com.
F. Vahid, “The Softening of Hardware,” IEEE Comput., vol. 36, no. 4, 2003, pp. 27–34.
S. Ishiwata et al., “A Single-Chip MPEG-2 Codec Based on Customizable Media Embedded Processor,” IEEE J. Solid-State Circuits, vol. 38, no. 3, 2003, pp. 530–540.
P. Faraboschi, G. Brown, and J. Fisher,“ Lx: A Technology for Customizable VLIW Embedded Processing,” in Proc. Int’ Symp. On Computer Architecture, Vancover, 2000, pp. 203–213.
H. Zhang, J.M. Rabaey et al., “ A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” in Proc. Int’l. Solid-State Circuits Conference (ISSCC), San Francisco, 2000.
A. Abnous and J. Rabaey, “Low-Power Domain-Specific Multimedia Processors,” in IEEE Workshop on VLSI Signal Processing, San Francisco, 1996, pp. 459–468.
G. A. Slavenburg, S. Rathnam, and H. Diskstra, “The Trimedia TM-1 PCI VLIW Media Processor,” in Proc. Notebook for Hot Chips VIII, Stanford, 1996, pp. 171–177.
M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.
C. McNairy and D. Soltis, “Itanium 2 Processor Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 44–55.
G. Hinton et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technol. J., vol. 5, no. 1, 2001, p. 13. http://www.intel.com.
G. S. Sohi, “Instruction issue logic for high-performance, interruptible, multiple functional unit pipelined computers,” IEEE Trans. Comput., vol. 39, no. 3, 1990, pp. 349–359.
D. Koufaty and D. T. Marr, “Hyperthreading Technology in the Netburst Microarchitecture,” IEEE MICRO, vol. 23, no. 2, 2003, pp. 56–64.
Y.-K. Chen, R. Lienhart, E. Debes, M. Holliman, and M. Yeung, “The Impact of SMT/SMP Designs on Multimedia Software Engineering: A Workload Analysis Study,” in Fourth International Symposium on Multimedia Software Engineering, 2002.
H. Oehring, U. Sigmund, and T. Ungerer, “MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors,” in Proceedings of Parallel Architectures and Compilation Techniques PACT99, Newport Beach, 1999.
J. Nickolls et al., “Calisto: A Low-Power Single-Chip Multiprocessor Communications Platform”, IEEE MICRO, vol. 23, 2003, p. 2.
J. Glossner, M. Schulte, M. Moudgill, D. Iancu, S. Jinturkar, T. Raja, G. Nacer, and S. Vassiliadis, “Sandblaster Low-Power Multithreaded SDR Baseband Processor”, in Proc. of the 3rd Workshop on Applications Specific Processors (WASP’04), Stockholm, 2004, pp. 53–58.
D. W. Wall, “Limits of Instruction-Level Parallelism”, in Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 176–188.
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, “Multiscalar Processors,” in ISCA-25, 1995, pp. 414–425.
R. Balasubramonian et al., “Reducing the Complexity of the Register File in Dynamic Superscalar Processors,” in Micro-34, Austin, 2001, pp. 237–249.
S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. D. Owens, “Register Organisation for Media Processing,” in HPCA-6, 2000, pp. 375–386.
E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace processors,” Proc. of the 30th International Symposium on Microarchitecture, 1999.
T. Sato, Y. Nakamura, and I. Arita, “Revisiting Direct Tag Search Algorithm on Superscalar Processors,” in Workshop on Complexity-Effective Design, Madison, 2001.
E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black, “Hierarchical Scheduling Windows,” in Micro-35, Istanbul, 2002.
R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler, “Design Space Evaluation of Grid Processor Architectures,” in Micro-34, Austin, 2001, pp. 40–53.
M. B. Taylor et al., “The RAW Microprocessor: A Computational Fabric For Software Circuits and General-Purpose Programs,” IEEE MICRO, vol. 22, no. 2, 2002, pp. 25–35.
M. Berekovic and T. Niggemeier, “A Scalable, Multi-Thread, Multi-Issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme,” in Workshop on Embedded Computer Systems SAMOS VI, Samos, 2006, pp. 289–298.
M. Berekovic, S. Moch, and P. Pirsch, “A Scalable, Clustered SMT Processor for Digital Signal Processing,” in ACM SigArch Newsletter, 2004, pp. 62–69.
M. Berekovic, “Eine skalierbare, verteilte Prozessor-Architektur mit simultanem Multi-Threading für Anwendungen der digitalen Signalverarbeitung,” VDI-Fortschritt Berichte Elektronik, Bd. 377, VDI-Verlag, Germany, 2005.
ISO/IEC JTC/SC29/WG11 N4668, Overview of the MPEG-4 Standard, Jeju, March 2002.
M. Berekovic, P. Pirsch, and J. Kneip, “An Algorithm-Hardware-System Approach to VLIW Multimedia Processors,” J. VLSI Signal Process. Syst., vol. 20, no. 1–2, 1998, pp. 163–180.
A. Allan, D. Edenfeld, W. H. Joyner, A. B. Kahng, M. Rodgers, and Y. Zorian, “2001 Technology Roadmap for Semiconductors,” IEEE Comput., vol. 35, no. 1, 2002, pp. 42–53.
M. H. Lipasti and J. P. Shen, Modern Processor Design, McGraw-Hill, 2002.
M. Berekovic, H. J. Stolberg, M. B. Kulaczewski, P. Pirsch, H. Möller, H. Runge, J. Kneip, and B. Stabernack, “Instruction Set Extensions for MPEG-4 Video,” J. VLSI Signal Process. Syst., vol. 23, no. 1, 1999, pp. 7–50.
J. P. Wittenburg, W. Hinrichs, J. Kneip, M. Ohmacht, M. Berekovic, H. Lieske, H. Kloos, and P. Pirsch, “Realization of a Programmable Parallel DSP for High Performance Image Processing Applications,” in Design Automation Conference (DAC) 1998, 1998, pp. 56–61.
R. Lee, “Accelerating Multimedia with Enhanced Microprocessors,” IEEE MICRO, vol. 15, no. 2, 1995, pp. 22–32.
A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 35–42.
Texas Instruments, “TMS320DM642 Technical Overview,” Application report SPRU615, 2002.
J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing the IA-64 Architecture,” IEEE MICRO, vol. 20, no. 5, 2000, pp. 12–23.
M. S. Lam and R. P. Wilson, “Limits of Control Flow on Parallelism”, in Proc. 19th Ann. Int’l Symp. on Computer Architecture, 1992, pp. 46–57.
D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” in Proc. 22th Ann. Int’l Symp. on Computer Architecture, 1995, pp. 392–403.
R. P. Preston et al., “Design of an 8-wide Superscalar RISC with Simultaneous Multithreading,” in Solid-State Circuits Conference (ISSCC2002), San Francisco, 2002, pp. 469–471.
S. Palacharla, N. P. Jouppi, and J. Smith, “Complexity Effective Superscalar Processors”, in Proc. 24th. Int’l. Symp. on Computer Architecture, 1997, pp. 206–218.
B. Ackland et al., “A Single Chip, 1.6-Billion, 16-b MAC/s Multiprocessor DSP,” IEEE J. Solid-State Circuits, vol. 35, no. 3, 2000, pp. 412–424.
H.-J. Stolberg, M. Berekovic, L. Friebe, S. Moch, S. Flügel, X. Mao, M. B. Kulaczewski, H. Klußmann, and P. Pirsch, “HiBRID-SoC: A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications,” in Proc. Design, Automation and Test in Europe (DATE2003)—Designer’s Forum, 2003, pp. 8–13.
K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning”, in Proc. 30th. Int’l. Symp. On Microarchitecure, 1997, pp.149–159.
R. E. Kessler, “The Alpha 21264 Microprocessor”, IEEE MICRO, vol. 19, no. 2, 1999, pp. 24–36.
R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM J. Res. Develop., vol. 11, no. 1, 1967, pp. 25–33.
R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires”, Proc. IEEE, vol. 89, no. 4, 2001, pp. 490–504.
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures”, in Proc. 27th Ann. Int’l. Symp on Computer Architecture, 2000, pp. 248–259.
J. Leenstra, J. Pille, A. Müller, W. M. Sauer, R. Sautter, and D. F. Wendel, “A 1.8 GHz Instruction Window Buffer for an out-of-order Microprocessor Core,” IEEE J. Solid-State Circuits, vol. 36, no. 11, 2001, pp. 1628–1635.
J. Hoogerbrugge, H. Corporaal, and H. Mulder, “Software pipelining for transport-triggered architectures,” MICRO-24, Albuquerque, 1991.
H. Corporaal, Microprocessor Architectures from VLIW to TTA, Wiley, 1998.
S. J. E. Wilton and N. P. Jouppi, “CACTI: An Enhanced Cache Access and Cycle Time Model,” IEEE J. Solid-State Circuits, vol. 31, no. 5, 1996, pp. 677–688.
P. Shivakumar and N. P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power, and Area Model,” in WRL Technical Report 2001/2, 2001.
S. Vangal et al., “5-Ghz 32-bit Integer Execution Core in 130-nm Dual-VT CMOS,” IEEE J. Solid-State Circuits, vol. 37, no. 11, 2002.
C. A. R. Hoare, “Communicating Sequential Processes,” Commun. ACM, vol. 21, no. 8, 1978.
ISO/IEC 14496-2:1999, Coding of Audio-Visual Objects—Part 2: Visual, ISO/IEC, 1999.
T. Niggemeier, “Multithreaded Implementierung von MPEG-4 Global Motion Compensation auf einer SMT-Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.
S. Bähre, “Multithreaded Implementierung von MPEG-4 Postprocessing Algorithmen auf einer SMT -Prozessorarchitektur,” Studienarbeit, Universität Hannover, Juli 2002.
M. Berekovic, H.-J. Stolberg, S. Flügel, S. Moch, M. B. Kulaczewski, L. Friebe, J. Hilgenstock, X. Mao, H. Klussmann, and P. Pirsch, “Implementing the MPEG-4 AS Profile on a Multi-Core System on Chip Architecture,” in Proc. of 3rd Workshop and Exhibition on MPEG-4 (WEMP4), 2002.
H.-J. Stolberg, M. Berekovic, and P. Pirsch, “A Platform-Independent Methodology for Performance Estimation of Streaming Media Applications,” in Proc. 2002 IEEE International Conference on Multimedia and EXPO (ICME2002), 2002, CD-ROM.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Berekovic, M., Berekovic, M. & Niggemeier, T. A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance. J Sign Process Syst Sign Image 50, 201–229 (2008). https://doi.org/10.1007/s11265-007-0138-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-007-0138-6