Elsevier

Advances in Computers

Volume 96, 2015, Pages 139-186
Advances in Computers

Chapter Four - Sorting Networks on Maxeler Dataflow Supercomputing Systems

https://doi.org/10.1016/bs.adcom.2014.10.001Get rights and content

Abstract

The primary contribution of this study is the implementation and evaluation of network sorting algorithms on a Maxeler dataflow computer. Sorting is extensively used in numerous applications. We discuss sequential, parallel, and network sorting algorithms. The major part of this study is dedicated to the properties, construction, and testing of sorting networks. We introduce and compare principal network sorting algorithms with predominant sequential and parallel sorting algorithms. We implement network sorting algorithms in an entry model of the Maxeler dataflow supercomputing system. The goal of our study is to compare the sorting times of network sorting algorithms using a Maxeler dataflow computer with the sorting times of optimal sequential and parallel sorting algorithms using a control flow computer. In different testing scenarios, we demonstrate that high sorting speedups can be achieved with network sorting using a Maxeler dataflow computer. We sorted arrays of 128 values. Using different testing parameters, we achieved speedups that ranged from approximately 10 to more than 200. Sorting networks that execute parallel sorting using the dataflow computational paradigm offer a possible solution for expanding volumes of data. By converting to more advanced Maxeler systems and researching new ideas and solutions, we aim to sort large arrays and achieve large speedups.

Introduction

Sorting is a critical operation in the realm of computer systems. According to Ref. [1], sorting accounts for approximately one-quarter of the total running time on computer systems and approximately one-half of the running time in some cases. These estimated figures suggest that (a) sorting is employed in numerous applications, (b) sorting is not always necessary, or (c) inefficient sorting algorithms are prevalent. Therefore, a constant quest for improved sorting algorithms and their proper use and practical implementation is necessary.

In numerous application domains, sorting is an indispensable part of applications with or without the knowledge of users. A common example is searching for information on the Internet, in which search algorithms work with sorted data; a natural method to present search results is an ordered list of items that matches the search criteria [2].

The majority of applications and computer systems employ well-studied comparison-based sequential sorting algorithms [3]. Parallel processing and the use of parallel sorting algorithms enable speedups. Parallelization is commonly achieved with the use of multi-core or many-core systems, which can produce speedups that are approximately proportional to the number of cores. A new paradigm termed dataflow computing has re-emerged and is being successfully employed in numerous computationally intensive application domains. Dataflow computing offers immense parallelism by utilizing thousands of tiny simple computational elements to improve the performance by orders of magnitude.

The exponential increase in data volumes demands faster communication networks and high-speed computing [4]; thus, it is a driving force for the search for efficient, faster, and parallelized computer algorithms [5]. Due to rapidly increasing data volumes, sorting algorithms face similar challenges. Although the field of sorting algorithms has been well studied during the entire era of computer science, new technologies, new paradigms, and new applications are constantly reshaping the demands and conditions for sorting algorithms. Algorithms that previously were impractical or inadequate have re-emerged with the development of new technologies.

This chapter is organized in the following manner. In Section 2, we present the motivation for this study, which defines the framework for our ideas, experiments, and results. In Section 3, we provide a short tutorial on sorting algorithms and include several tables and graphs for their classification and comparison, with a special emphasis on network sorting algorithms. Section 4 addresses sorting networks, which is the theoretical core of this study. We present the basic properties, construction, algorithms, operating examples, and comparisons of sorting networks. In Section 5, we explain the implementation of the sorting networks on the dataflow computer. In Section 6, we define our experimental setup. Section 7, which is extensive, is dedicated to the presentation of our experimental results. Finally, we present the conclusions of this study in Section 8.

Section snippets

Motivation

The primary motivation of this study is to investigate the potential use of a dataflow computing paradigm for sorting algorithms and their implementation on a dataflow computer. This study requires that several tasks to be performed in different fields of computing.

We had the opportunity to use the Maxeler MAX2 card, which is an entry model of the Maxeler dataflow supercomputing systems. The MAX2 card is considered to be an field programmable gate array (FPGA) hardware accelerator. Although the

Sorting Algorithms

Sorting algorithms have been extensively investigated during the entire era of computer science. Numerous different approaches have been utilized and an immense number of sorting algorithms have been developed and analyzed.

Sorting algorithms can be classified based on different criteria, such as computational complexity (best, average, and worst), memory usage, stability, general sorting method (insertion, selection, merging, exchange, and partitioning), and whether they involve comparison

Sorting Networks

Sorting networks, which are a subset of comparison networks, are abstract machines that are solely constructed of wires and comparators. In comparison networks, the wires that interconnect comparators must form a directed acyclic graph [3].

A sorting network is a comparison network with wires and comparators that are connected in a manner to perform the function of sorting their input data. The input to a sorting network is an array of items in an arbitrary order, whereas the output of a sorting

Implementation

We have implemented different sorting network algorithms on the MAX2 card, which is the low-end version of the Maxeler dataflow supercomputing system and is an FPGA hardware accelerator. Consequently, the execution of algorithms that run on Maxeler systems must be highly independent of input data or intermediate computational results and only dependent on the number of input data [2].

Maxeler systems are dataflow computers that are capable of employing a large number of tiny interconnected

Setting Up the Experiment

Applications typically use sorting as a part of a more extensive process. Within these processes, sorting can be used occasionally, regularly, or continuously. To cover different frequencies of sorting usage in applications, we have devised two scenarios.

In the first scenario, we consecutively sort several arrays of numbers. Each array is independently sent to the Maxeler card. Consequently, the card must be separately started for each array; this policy increases the total sorting time.

In the

Experimental Results

We present the experimental results in three categories. First, we discuss some limitations and constraints of the implementation of network sorting algorithms on the MAX2 card. Second, we compare the sorting times of different network sorting algorithms. Finally, we compare the best network sorting algorithm executed on a MAX2 card with the best sequential (and parallel) algorithms executed on a host CPU; the final results comprise the speedup results achieved through the utilization of the

Conclusion

Our study has demonstrated that nearly forgotten algorithms and solutions, which were previously impossible or impractical to implement, can re-emerge with advances in technology that enable their use. Sorting networks and their implementation on FPGAs are a natural match, whereas parallel execution of comparisons yields high sorting speedups. Speedups are demonstrated by our study and the extensive results in this chapter.

The limited number of comparators that can be realized on the FPGA chip

ABOUT THE AUTHORS

Anton Kos received his Ph.D. in electrical engineering from the University of Ljubljana, Slovenia, in 2006. He is a senior lecturer at the Faculty of Electrical Engineering, University of Ljubljana. He is a member of the Laboratory of Communication Devices at the Department of Telecommunications. His teaching and research work includes communication networks and protocols, quality of service, dataflow computing and applications, usage of inertial sensors in biofeedback

References (20)

  • Donald E. Knuth

    The art of computer programming

    (2002)
  • Robert Sedgewick

    Algorithms in Java

    (2010)
  • Thomas H. Cormen et al.

    Introduction to Algorithms

    (2009)
  • M.J. Flynn et al.

    Moving from Petaflops to Petadata

    Communications of the ACM

    (2013)
  • Yale Patt

    Future Microprocessors: What Must We do Differently if We Are to Effectively Utilize Multi-Core and Many-Core Chips?

    (2009)
  • Sorting Algorithm, http://en.wikipedia.org/wiki/Sorting_algorithm (accessed...
  • Parallel Computing, http://en.wikipedia.org/wiki/Parallel_computing (accessed...
  • Veljko Milutinovic

    Surviving the Design of Microprocessor and Multimicroprocessor Systems: Lessons Learned

    (2000)
  • K.E. Batcher

    Sorting networks and their applications

  • Sorting Network, http://en.wikipedia.org/wiki/Sorting_network (accessed...
There are more references available in the full text version of this article.

Cited by (19)

  • A runtime job scheduling algorithm for cluster architectures with dataflow accelerators

    2022, Advances in Computers
    Citation Excerpt :

    It was designed by Ken Batcher for sorting networks of size O(n*(log n)2), but it gained popularity with GPU processing. Kos et al. implemented the Odd-Even Merge Network Sort algorithm using the Maxeler DataFlow hardware, and achieved acceleration factors of 100–150, when sorting more than 1000 arrays consecutively [37]. Each of the arrays consisted of 16–128 elements.

  • Adaptation and Evaluation of the Simplex Algorithm for a Data-Flow Architecture

    2017, Advances in Computers
    Citation Excerpt :

    To be able to fully exploit the advantages of the data-flow architecture, the algorithms have to be carefully reengineered, since most algorithms were tailored specifically for the control-flow architecture and are thus often not directly implementable as a data-flow. Many algorithms have already been implemented for this architecture, from sorting algorithms [6,7], fast Fourier transform [8] to various physics simulations, e.g., weather models [9]. It has been most successful in classical domains of high-performance computing, i.e., various numerical applications.

  • Exploring Future Many-Core Architectures: The TERAFLUX Evaluation Framework

    2017, Advances in Computers
    Citation Excerpt :

    COTSon permits to integrate the architectural elements to enable a dataflow-based execution model on the top of control-flow elements, as explained in Section 1. Further recent research on dataflow-based system can be found in the literature [70–75]. In summary, exploring a future many-core architecture like TERAFLUX requires the flexibility, simulation speed, and scalability of COTSon.

  • Data Flow Computing in Geoscience Applications

    2017, Advances in Computers
    Citation Excerpt :

    The Maxeler DFE is a high-performance processing system that combines reconfigurable FPGAs to provide high efficiency and easy-to-use programming interfaces. There are some successful implementations and accelerations of serial applications using Maxeler systems [10–13]. Thus, in the rest of this section, we give a brief introduction to the Maxeler DFE systems.

  • Dataflow-Based Parallelization of Control-Flow Algorithms

    2017, Advances in Computers
    Citation Excerpt :

    For example, if each element, except the first one in a row of a table, depends on the previous element in the same row, but rows are mutually independent, instead of calculating the elements row by row, elements could be calculated column by column. Kos et al. achieved the speedup of 100–150 for sorting more than 1000 arrays consecutively using the dataflow implementation of the Odd–even merge network sort algorithm [69,70]. Arrays consisted of 16–128 elements; speedup is calculated as a ratio between the execution time of sorting on the CPU and the execution time of sorting on DFE.

View all citing articles on Scopus

ABOUT THE AUTHORS

Anton Kos received his Ph.D. in electrical engineering from the University of Ljubljana, Slovenia, in 2006. He is a senior lecturer at the Faculty of Electrical Engineering, University of Ljubljana. He is a member of the Laboratory of Communication Devices at the Department of Telecommunications. His teaching and research work includes communication networks and protocols, quality of service, dataflow computing and applications, usage of inertial sensors in biofeedback applications, and information systems. He is the (co)author of eight papers that appeared in the international engineering journals and of more than thirty papers on international conferences.

Vukašin Ranković received his B.Sc. in electrical engineering and information theory from the University of Belgrade, Serbia, in 2014. He is a master student at the School of Electrical Engineering, University of Belgrade. He also works at the Microsoft Development Center Serbia as a Software engineer in Bing Local Classification team. His research and work includes machine learning and dataflow computing. He is the (co)author of one paper that appeared in the international engineering journal and of one paper on international conference.

Sašo Tomažič received his Ph.D. in Electrical Engineering from the University of Ljubljana, Slovenia, in 1991. He is a full professor at the Faculty of Electrical Engineering, University of Ljubljana. He is the head of the Laboratory of Communication Devices and the head of the Department of Telecommunications. He was the advisor for networking, teleworking, and telematics at the Ministry of Education of Slovenia; the advisor for Telecommunications at the Ministry of Defense; and the National Coordinator for telecommunications research of the Republic of Slovenia. He is the (co)author of more than fifty papers that appeared in the international engineering journals and of more than a hundred papers on international conferences.

View full text