Fast and noniterative scheduling in input-queued switches: Supporting QoS

doi:10.1016/j.comcom.2008.12.008

Computer Communications

Volume 32, Issue 5, 27 March 2009, Pages 834-846

https://doi.org/10.1016/j.comcom.2008.12.008 Get rights and content

Abstract

We report three fast and scalable scheduling algorithms that provide exact bandwidth guarantee, low delay bound, and reasonable jitter in input-queued switches. The three schedulers find a maximum input/output matching in a single iteration. They sustain 100% throughput under both uniform and bursty traffic. They work many times faster than existing scheduling schemes and their speed does not degrade with increased switch size. SRA and SRA+ algorithms are of O(1) time complexity and can be implemented in simple hardware. SRA tends to incur different delays to flows of different classes of service due to their different subscribed portions of the total bandwidth. SRA+, a weighted version of SRA, operates on cells that arrive uniformly and on cells of packets such that the cells of whole packets are switched contiguously. SRA+ improves over SRA in that all flows undergo the same delays regardless of their bandwidth shares. The schedulers operate on queue groups at the crossbar arbiters in a distributed manner.

Introduction

Real-time network services impose stringent quality of service (QoS) requirements on switches. Switches must guarantee bandwidth (or rate), bound delay, and smooth jitter. This paper presents three novel and efficient switch fabrics that support these QoS functions. Most past works as recorded in the literature address providing only bandwidth guarantee. No known work for QoS rigorously tackles delay bounding. We show how exact bandwidth guarantee, low delay bound, and reasonable jitter can be provided in a switch fabric.

Our switch fabrics are based on the input-queued (IQ) paradigm. We consider them as using crossbars with unbuffered crosspoints. IQ switches have large packet buffers at the input ports. Packets arriving at an input port are organized into separate queues according to their destination output and their class of service (CoS). Such a queue is called a virtual output queue (VOQ). Using VOQs removes head-of-line (HOL) blocking that occurs when FIFO queues are used. In providing QoS, our switch fabrics operate on traffic aggregates. A traffic aggregate can be a number of flows arriving at different input ports but going to the same destination; all the flows require the same CoS.

The switching problem is classical. When packets get to a node of a network, they have to be split and sent onto the right output links. As a node usually has multiple inputs and multiple outputs, this splitting becomes hard to do. The crux of the problem is to match the inputs that have packets with the outputs that the packets should exit through. The problem is typically modeled as finding a conflict-free matching on a bipartite graph formed by the inputs and outputs. Matching is usually done in a centralized manner. To find a maximum matching, the best algorithm for uniform traffic runs in O(N^2.5) time, and the most efficient algorithm for nonuniform traffic runs in O(N³log N) time, where N is the number of inputs and also the number of outputs [22]. Typical approximation algorithms are iterative maximal matching algorithms such as PIM [1] and iSLIP [19] that find a maximal matching in O(log N) iterations. These algorithms are complex for supporting only best-effort traffic. Their ability to support QoS is untested albeit they have variants intended for enforcing preference and fairness among different traffic flows.

The importance of solving the switching problem cannot be overstated. A solution is vital to the design and engineering of a switch. As with any other computing problem, the solution has to be efficient and scalable for the network to be functional. New services call for faster and smarter switches. IQ switch with unbuffered crossbars is a paradigm that has stood the test of time and is used in most high-end switches. However, as traffic rate goes from giga to tera bits per second and QoS requirements become more stringent, existing algorithms for the IQ switch are no longer adequate.

Robust and efficient switch fabrics that support QoS are still lacking. Existing schemes found in the literature for the IQ switch that support certain QoS have limitations. They often are too complex, too slow, and not scalable. Some schemes are based on the Slepian–Duguid algorithm (e.g. [1], [11], [15]). Some are based on the Gale-Shapley algorithm (e.g. [12], [17]). Both types of algorithms are not effective in reserving bandwidth and can cause severe degradation in throughput. They are also too complex for hardware implementation and to be scalable. Iterative maximal matching schemes (e.g. [19], [29], [2]) tend to incur long delay and are not scalable. These algorithms and their relevance to our work will be discussed in greater detail in Section (6).

Our work offers better solutions that could be used in future-generation switches. In previous work, we studied a best-effort switch fabric that implements a fast noniterative fabric scheduling algorithm called single round-robin arbitration (SRA) and its corresponding scalable architectures [7], [6]. SRA finds a maximum matching in a single iteration. Its time complexity is O(1). SRA runs many times faster than existing algorithms for crossbar arbitration while operating at line rate. SRA is simple to implement in hardware as are its supporting architectures.

In this work, we evaluate how well SRA supports QoS. We show that SRA is as efficient as it is for best-effort traffic. In addition, we follow the framework of SRA for arbitration and add a credit-based weighting mechanism to form a new system, called SRA+, for QoS assurance. SRA+ works in two modes. The SRA+ scheme still runs at line rate and is capable of fast arbitration and exact QoS guarantee. It is many times faster than existing QoS schemes as it uses SRA for arbitration. Thus it can satisfy most stringent delay requirements. Furthermore, SRA/SRA+ algorithms are all of constant time which enables simple hardware implementation. SRA and SRA+ employ simple queuing structure and disciplines and match simple architectures.

It is worth noting that it is easy to implement QoS functions in an output-queued (OQ) switch where packets are placed in queues in the output ports after they arrive at the switch. However, an N × N OQ switch must operate N times faster than the line rate. Memory technology simply cannot catch up to that kind of speed. There exist architectures that emulate the working of the OQ switch [25], [9], [28]. OQ emulation of constant speedup automatically provides QoS. The weakness of the existing work is mainly in that it is too complex, not practical, and thus restricted to theoretical value.

The rest of the paper is organized as follows. Section (2) describes the IQ switch model, the queuing structure of the fabric, and two supporting crossbars. Section (3) defines the switching problem and the SRA algorithm in supporting QoS. It also discusses SRA’s performance based on our simulation results. Section (4) introduces the SRA+ algorithm that works on cells regardless of the packets these cells belong to. It also evaluates the performance of SRA+ for cells based on simulations. We specifically show that SRA+ eliminates the dispersion in various statistics found existing in SRA. Section (5) describes the SRA+ algorithm that works on cells of whole packets. We show how our simulations can be modeled as a standard loading process and derive the formula for calculating the load rate. We then discuss how SRA+ performs in this mode compares to its performance in the pure-cells mode. In Section (6), we discuss existing work in supporting QoS. We point out the pros and cons of known algorithms and why our SRA/SRA+ schedulers are better. We conclude this paper in Section (7). In the appendix, we describe the SRA algorithm for best-effort traffic and show the proofs of some of SRA’s properties. From these proofs, the validity of the same properties of the SRA+ algorithms can be easily derived.

Section snippets

Model

Fig. 1 illustrates a switch model that represents the IQ switch. It shows the overall architecture and the queuing structure of the IQ switch. The model consists of N input ports, an N × N crossbar with no buffer at cross-points, and N output ports. Arbiters sit between the input ports and the crossbar to arbitrate access to the crossbar.

An input port of an IQ switch will have buffering of around 150 ms which is approximately the round-trip time of traffic along a typical Internet path. An output

SRA for QoS

SRA was initially devised to fast switch best-effort traffic, but it has the ability to guarantee bandwidth. In addition, its delay performance does not degrade when used to guarantee bandwidth. Furthermore, it guarantees bandwidth exactly. We can find a delay bound that is the same for all CoSs. Service contracts can be practically enacted by specifying the bandwidth required.

SRA+ for cells

SRA+ is a weighted version of SRA. It minimizes the dispersion of statistics of CoSs taking different bandwidth proportions at high loads observed in using SRA as discussed in the last section. SRA+ maintains the low delay and low jitter properties of SRA in bandwidth guarantee. It sustains 100% throughput as SRA. The SRA+ algorithm runs in O(1) time and is amenable to simple hardware implementation.

Switching packets: SRA+ under bursty traffic

In the last two sections, we evaluated the working of SRA and SRA+ when packets are segmented into cells of fixed size and the switch fabrics operate on these cells with no regard to the packets they belong to. This apparently makes the cells belonging to the same packet to be switched uncontiguously and cause larger delay for a packet to be reassembled and sent out of the output port as a whole packet. In this section, we make SRA+ to switch all the cells belonging to a packet after

Related work

In order to put our contribution in the right perspective, we review the major relevant work. Existing schemes capable of QoS are often too slow, too complex, and not scalable. These schemes mostly support only rate guarantee and did not address the delay guarantee issue.

One class of algorithms are based on the Slepian–Duguid algorithm (SDA) as in [1], [11], [15], [12], [17]. SDA is a frame-based algorithm. It was originally developed for Clos networks (Chapter 3 of [10]). It precomputes a

Concluding remarks

The three noniterative schedulers (or algorithms) SRA, SRA+ for cells, and SRA+ for packets are many times faster than known conventional QoS schedulers. They have good QoS performance. They are capable of exact bandwidth guarantee and tight delay bound. They sustain 100% throughput and make subscribing out total available bandwidth possible. They are also scalable. Their performance does not degrade with switch size. They also all run in constant time.

The three algorithms can be implemented in

Acknowledgements

This work is supported in part by NSF CCR-0309461, NSF IIS-0513669, HK CERG 526007 (HK PolyU B-Q06B), NSFC 60728206, and NSF 0714057.

The code used for the simulations is based in part on the code authored by Prof. Ken Christensen of the University of South Florida. The initial code is for best-effort traffic. PIM and iSLIP results obtained with the initial code were validated against results shown in [21].

The authors thank the anonymous reviewers for reviewing the paper and recommending it for

References (32)

N. McKeown et al.
A quantitative comparison of iterative scheduling algorithms for input-queued switches
Computer Networks and ISDN Systems
(1998)
B. Prabhakar et al.
On the speedup required for combined input- and output-queued switching
Automatica
(1999)
T.E. Anderson et al.
High-speed switch scheduling for local-area networks
ACM Transactions on Computer Systems
(1993)
H. Balakrishnan, S. Devadas, D. Ehlert, Arwind, Rate guarantees and overload protection in input-queued switches, in:...
G. Bracha, Removing egress memory from switching architectures, CommsDesign.com, Feb. 2003. URL:...
C.-S. Chang, W.-J. Chen, H.-Y. Huang, On service guarantees for input-buffered crossbar switches: a capacity...
C.-S. Chang, W.-J. Chen, H.-Y. Huang, Birkhoff-von Neumann input buffered crossbar switches, in: Proceedings of IEEE...
K.F. Chen, E.H.-M. Sha, S.Q. Zheng, Fast and noniterative scheduling in input-queued switches, submitted for...
K.F. Chen, E.H.-M. Sha, S.Q. Zheng, A fast noniterative scheduler for input-queued switches with unbuffered crossbars,...
S.-T. Chuang, A. Goel, N. McKeown, B. Prabhakar, Matching output queueing with a combined input output queued switch,...

S.-T. Chuang et al.

Matching output queueing with a combined input/output-queued switch

IEEE Journal on Selected Areas in Communications

(1999)

J.Y. Hui

Switching and Traffic Theory for Integrated Broadband Networks

(1990)

A. Hung, G. Kesidis, N. McKeown. ATM input-buffered switches with the guaranteed-rate property, in: Proceedings of IEEE...

A.C. Kam et al.

Linear-complexity algorithms for QoS support in input-queued switches with no speedup

IEEE Journal on Selected Areas in Communications

(1999)

H. Kim et al.

Performance analysis of the multiple input-queued packet switch with the restricted rule

IEEE/ACM Transactions on Networking

(2003)

H. Kim et al.

Throughput analysis of the bifurcated input-queued ATM switch

IEICE Transactions on Communications

(1999)

Cited by (5)

An efficient single-iteration single-bit request scheduling algorithm for input-queued switches
2013, Journal of Network and Computer Applications
Citation Excerpt :
π-RGA can achieve high throughput performance, especially under non-uniform traffic, but with much big request message size. In SRA (Chen et al., 2009), each output port j maintains a single FIFO status queue for all non-empty VOQ(i,j)s (i =0,1,…,N−1) destined to j. Output j always chooses the head of line from the status queue, say VOQ(i,j), to send a grant. Output j then removes VOQ(i,j) to the tail of the status queue.
Aiming at minimizing communication overhead of iterative scheduling algorithms for input-queued packet switches, an efficient single-iteration single-bit request scheduling algorithm called Highest Rank First with Request Compression 1 (HRF/RC1) is proposed. In HRF/RC1, scheduling priority is given to the preferred input–output pair first, where each input has a distinct preferred output in each time slot. If an input does not have backlogged packets for its preferred output, each of its non-empty VOQs sends a single-bit request to the corresponding output. This single bit distinguishes one longest VOQ from other non-empty VOQs among an input port. If an output receives a request from its preferred input, it grants this input. Otherwise, it gives the higher priority to the longest VOQ than other non-empty VOQs. Similarly, an input accepts the grant following the same propriety sequence. In case of a tie, the winner is selected randomly. Compared with other single-iteration algorithms with comparable communication overhead, we show by simulations that HRF/RC1 always gives the best delay-throughput performance.
An optimal algorithm for time-slot assignment in SS/TDMA satellite systems
2013, Proceedings - International Conference on Computer Communications and Networks, ICCCN
Making high-speed OQ switches with QoS guarantees practical
2012, Proceedings of the 2012 International Symposium on Pervasive Systems, Algorithms, and Networks, I-SPAN 2012
Minimizing the communication overhead of iterative scheduling algorithms for input-queued switches
2011, GLOBECOM - IEEE Global Telecommunications Conference
D-LQF: An efficient distributed scheduling algorithm for input-queued switches
2011, IEEE International Conference on Communications

View full text

Fast and noniterative scheduling in input-queued switches: Supporting QoS

Abstract

Introduction

Section snippets

Model

SRA for QoS

SRA+ for cells

Switching packets: SRA+ under bursty traffic

Related work

Concluding remarks

Acknowledgements

Computer Networks and ISDN Systems

Automatica

High-speed switch scheduling for local-area networks

ACM Transactions on Computer Systems

Matching output queueing with a combined input/output-queued switch

IEEE Journal on Selected Areas in Communications

Switching and Traffic Theory for Integrated Broadband Networks

Linear-complexity algorithms for QoS support in input-queued switches with no speedup

IEEE Journal on Selected Areas in Communications

Performance analysis of the multiple input-queued packet switch with the restricted rule

IEEE/ACM Transactions on Networking

Throughput analysis of the bifurcated input-queued ATM switch

IEICE Transactions on Communications