FLUX interconnection networks on demand

https://doi.org/10.1016/j.sysarc.2007.01.006Get rights and content

Abstract

In this paper, we introduce the FLUX interconnection networks, a scheme where the interconnections of a parallel system are established on demand before or during program execution. We present a programming paradigm which can be utilized to make the proposed solution feasible. We perform several experiments to show the viability of our approach and the potential performance gain of using the most suitable network configuration for a given parallel program. We experiment on several case studies, evaluate different algorithms, developed for meshes or trees, and map them on “grid”-like or reconfigurable physical interconnection networks. Our results clearly show that, based on the underlying network, different mappings are suitable for different algorithms. Even for a single algorithm different mappings are more appropriate, when the processing data size, the number of utilized nodes or the hardware cost of the processing elements changes. The implication of the above is that changing interconnection topologies/mappings (dynamically) on demand depending on the program needs can be beneficial.

Introduction

In computer engineering, improvements have been achieved with the technological advances in terms of area, which presumably increases exponentially, delay and chip I/O count, which we postulate increases at best linearly. It has been postulated that, under the conjectures stated above, microarchitectures provide a substantial increase in performance in uniprocessor systems. Based on experimental evidence, however, it has been indicated that it is doubtful such a claim can be substantiated in the recent past [2]. Given that uniprocessor microarchitectures may experience some difficulties to exploit technological advances, it can be envisioned that multiprocessors could be the answer to the performance quest. In the very near future, it is almost certain that the VLSI technology will allow single chip multicore general purpose processors to become feasible (possibly exceeding the order of 10x, where x  2). Multiprocessor multichip parallel systems are not new (e.g. see ILIAC IV [3]), and it will appear that using past multiprocessor experiences and applying them in single chip VLSI implementations will provide a solution to general purpose uniprocessor performance scalability. While multiprocessors can be implemented on a chip the VLSI design of single chip massive multiprocessors is only one of the challenges and by no means the only one. Simply stated, being able to fit numerous processors in a single chip, does not necessarily imply that the performance increases substantially. It is well known, that in the past only a small fraction of peak performance has been achieved in parallel systems. There are numerous problems that prohibit top performance achievements. For example, assuming shared memory paradigms, scalability is not guaranteed a priori. Clearly, coherence does not scale (not easily) and most definitely creates costs that substantially diminish potential multiprocessor advantages. Additionally, software performance is not “portable”. That is, software development for a system at time t may not scale to a system developed at time t + 1. One of the fundamental reasons, but by no means the only one, is that software does not “mutate” to take into account new network topologies, while seldom parallel systems use a single network topology from one design point to the next.

In this paper, we address a single challenge regarding multiprocessor parallel systems. We consider the effects the interconnects have on the portability and scalability of software performance. It is a well known fact that developed algorithms have in mind an interconnection network. Traditionally speaking, interconnection networks are rigid and often (actually usually) the interconnection network changes from one design point to the next. A consequence of the above is that algorithms and software, when ported to a new family of multiprocessor parallel systems, will not scale in terms of performance (at least) and new software development has to be under way if performance is critical. We introduce a new approach, diametrically opposite to the existing network proposals, for adaptable networks stated by the following: Interconnection networks are provided (dynamically) on demand to suit the needs of an application/algorithm/program. We describe some potential implementation and propose a programming paradigm that may allow the interconnects to be fused with traditional models. Finally, we provide experimental evidence suggesting that our proposal is promising.

The paper is organized as follows: In Section 2 we discuss previous solutions in interconnects of multiprocessor parallel systems and point out their performance drawbacks. In Section 3, we introduce the FLUX networks, present several implementation schemes and provide a programming paradigm to change dynamically on demand processing and interconnecting of processors (general purpose or not) allowing them to adapt to the interconnect demands of software. In Section 4, we provide initial experimental data supporting our approach. Finally, in Section 5 we present our conclusions.

Section snippets

Background

Currently, multiprocessor systems are designed based on a specific hardwired interconnect topology. That is, the designer provides the physical structure of the interconnects having in mind a regular network topology such as crossbar, cube, fat-tree, etc. Furthermore, the network structure is fixed and rigid. For example, once the designer fixes the link width, it will remain the same for the entire life time of a parallel system. Additionally, since the physical structure of the network is

FLUX interconnects on demand

In FLUX Networks, the network is the one to be adapted instead of the parallel programs. To do so, the underlying physical network requires to provide higher flexibility than the current fixed networks. Obviously, this flexibility comes at the expense of delay and possibly area overhead, which is a fair price to pay, just like in previous experience in general purpose computers. Concerning the delay overhead, it is a fact that an application of a logical network “A” when ported in a fixed

Experimental results

In this section, we provide evidence suggesting the viability of our proposal when the underlying network is either fixed or reconfigurable. First, we evaluate several sample parallel problems using logical interconnects that are binary-trees (BT) or 2D meshes. The physical interconnections are assumed to be a 2D mesh. That is, for specific mesh logical topologies the links are physical = logical, while for the BT logical topologies usually physical  logical. We use a regular physical structure

Conclusions

In this paper, we introduced the FLUX networks and have discussed some performance potential for parallel applications suitable for different interconnection topologies/mappings. We studied different types of physical interconnections and presented a programming paradigm as a way to accomplish the configuration (mapping) of an interconnection network on demand. In addition, we presented some experimental results to show that, when running a parallel algorithm in a multiprocessor system

Stamatis Vassiliadis (M’86-SM’92-F’97) was born in Manolates, Samos, Greece, in 1951. He is currently a Chair Professor in the Electrical Engineering, Mathematics, and Computer Science (EEMCS) department of Delft University of Technology (TU Delft), The Netherlands. He previously served in the Electrical and Computer Engineering faculties of Cornell University, Ithaca, NY and the State University of New York (S.U.N.Y.), Binghamton, NY. For a decade, he worked with IBM, where he was involved in

References (20)

  • S. Vassiliadis, I. Sourdis, FLUX networks: interconnects on demand, in: International Conference on Embedded Computer...
  • S. Vassiliadis, L.A. Sousa, G.N. Gaydadjiev, The Midlifekicker Microarchitecture Evaluation Metric, in: Proceedings of...
  • S. Vassiliadis, L.A. Sousa, G.N. Gaydadjiev, The Midlifekicker Microarchitecture Evaluation Metric, in: Proceedings of...
  • S. Ranka et al.

    Hypercube Algorithms for Image Processing and Pattern Recognition

    (1990)
  • F.T. Leighton

    Introduction to Parallel Algorithms and Architectures: Array, Trees, Hypercubes

    (1992)
  • M. Reingold et al.

    Combinational Algorithms: Theory and Practice

    (1977)
  • B. Monien, I. Sudborough, Embedding one interconnection network in another, in: G. Tinhofer et al. (Eds.),...
  • S. Vassiliadis, I. Sourdis, Reconfigurable fabric interconnects, in: International Symposium on System-on-Chip (SoC),...
  • S. Vassiliadis et al.

    The molen polymorphic processor

    IEEE Transactions on Computers

    (2004)
  • S. Vassiliadis, I. Sourdis, Reconfigurable FLUX Networks, in: IEEE International Conference on Field Programmable...
There are more references available in the full text version of this article.

Cited by (0)

Stamatis Vassiliadis (M’86-SM’92-F’97) was born in Manolates, Samos, Greece, in 1951. He is currently a Chair Professor in the Electrical Engineering, Mathematics, and Computer Science (EEMCS) department of Delft University of Technology (TU Delft), The Netherlands. He previously served in the Electrical and Computer Engineering faculties of Cornell University, Ithaca, NY and the State University of New York (S.U.N.Y.), Binghamton, NY. For a decade, he worked with IBM, where he was involved in a number of advanced research and development projects. He received numerous awards for his work, including 24 publication awards, 15 invention awards, and an outstanding innovation award for engineering/scientific hardware design. His 72 USA patents rank him as the top all time IBM inventor. Dr. Vassiliadis received an honorable mention Best Paper award at the ACM/IEEE MICRO25 in 1992 and Best Paper awards in the IEEE CAS (1998, 2001), IEEE ICCD (2001), PDCS (2002) and the best poster award in the IEEE NANO (2005). He is an IEEE and ACM fellow and a member of the Dutch Academy of Science.

Ioannis Sourdis was born in Corfu, Greece, in 1979. He received his Diploma degree in 2002 and his Masters Degree in 2004 in Electronic and Computer Engineering from Technical University of Crete, Greece. He is currently working towards the Ph.D. in Computer Engineering in the Delft University of Technology, The Netherlands. His research interests include the architecture and design of computer systems, multiprocesor parallel systems, interconnection networks, reconfigurable hardware, and networking systems.

This work was supported by the European Commission in the context of the Scalable computer ARChitectures (SARC) integrated project #27648 (FP6). This paper is an extended version of [1].

View full text