Chapter Four - Sorting Networks on Maxeler Dataflow Supercomputing Systems
Introduction
Sorting is a critical operation in the realm of computer systems. According to Ref. [1], sorting accounts for approximately one-quarter of the total running time on computer systems and approximately one-half of the running time in some cases. These estimated figures suggest that (a) sorting is employed in numerous applications, (b) sorting is not always necessary, or (c) inefficient sorting algorithms are prevalent. Therefore, a constant quest for improved sorting algorithms and their proper use and practical implementation is necessary.
In numerous application domains, sorting is an indispensable part of applications with or without the knowledge of users. A common example is searching for information on the Internet, in which search algorithms work with sorted data; a natural method to present search results is an ordered list of items that matches the search criteria [2].
The majority of applications and computer systems employ well-studied comparison-based sequential sorting algorithms [3]. Parallel processing and the use of parallel sorting algorithms enable speedups. Parallelization is commonly achieved with the use of multi-core or many-core systems, which can produce speedups that are approximately proportional to the number of cores. A new paradigm termed dataflow computing has re-emerged and is being successfully employed in numerous computationally intensive application domains. Dataflow computing offers immense parallelism by utilizing thousands of tiny simple computational elements to improve the performance by orders of magnitude.
The exponential increase in data volumes demands faster communication networks and high-speed computing [4]; thus, it is a driving force for the search for efficient, faster, and parallelized computer algorithms [5]. Due to rapidly increasing data volumes, sorting algorithms face similar challenges. Although the field of sorting algorithms has been well studied during the entire era of computer science, new technologies, new paradigms, and new applications are constantly reshaping the demands and conditions for sorting algorithms. Algorithms that previously were impractical or inadequate have re-emerged with the development of new technologies.
This chapter is organized in the following manner. In Section 2, we present the motivation for this study, which defines the framework for our ideas, experiments, and results. In Section 3, we provide a short tutorial on sorting algorithms and include several tables and graphs for their classification and comparison, with a special emphasis on network sorting algorithms. Section 4 addresses sorting networks, which is the theoretical core of this study. We present the basic properties, construction, algorithms, operating examples, and comparisons of sorting networks. In Section 5, we explain the implementation of the sorting networks on the dataflow computer. In Section 6, we define our experimental setup. Section 7, which is extensive, is dedicated to the presentation of our experimental results. Finally, we present the conclusions of this study in Section 8.
Section snippets
Motivation
The primary motivation of this study is to investigate the potential use of a dataflow computing paradigm for sorting algorithms and their implementation on a dataflow computer. This study requires that several tasks to be performed in different fields of computing.
We had the opportunity to use the Maxeler MAX2 card, which is an entry model of the Maxeler dataflow supercomputing systems. The MAX2 card is considered to be an field programmable gate array (FPGA) hardware accelerator. Although the
Sorting Algorithms
Sorting algorithms have been extensively investigated during the entire era of computer science. Numerous different approaches have been utilized and an immense number of sorting algorithms have been developed and analyzed.
Sorting algorithms can be classified based on different criteria, such as computational complexity (best, average, and worst), memory usage, stability, general sorting method (insertion, selection, merging, exchange, and partitioning), and whether they involve comparison
Sorting Networks
Sorting networks, which are a subset of comparison networks, are abstract machines that are solely constructed of wires and comparators. In comparison networks, the wires that interconnect comparators must form a directed acyclic graph [3].
A sorting network is a comparison network with wires and comparators that are connected in a manner to perform the function of sorting their input data. The input to a sorting network is an array of items in an arbitrary order, whereas the output of a sorting
Implementation
We have implemented different sorting network algorithms on the MAX2 card, which is the low-end version of the Maxeler dataflow supercomputing system and is an FPGA hardware accelerator. Consequently, the execution of algorithms that run on Maxeler systems must be highly independent of input data or intermediate computational results and only dependent on the number of input data [2].
Maxeler systems are dataflow computers that are capable of employing a large number of tiny interconnected
Setting Up the Experiment
Applications typically use sorting as a part of a more extensive process. Within these processes, sorting can be used occasionally, regularly, or continuously. To cover different frequencies of sorting usage in applications, we have devised two scenarios.
In the first scenario, we consecutively sort several arrays of numbers. Each array is independently sent to the Maxeler card. Consequently, the card must be separately started for each array; this policy increases the total sorting time.
In the
Experimental Results
We present the experimental results in three categories. First, we discuss some limitations and constraints of the implementation of network sorting algorithms on the MAX2 card. Second, we compare the sorting times of different network sorting algorithms. Finally, we compare the best network sorting algorithm executed on a MAX2 card with the best sequential (and parallel) algorithms executed on a host CPU; the final results comprise the speedup results achieved through the utilization of the
Conclusion
Our study has demonstrated that nearly forgotten algorithms and solutions, which were previously impossible or impractical to implement, can re-emerge with advances in technology that enable their use. Sorting networks and their implementation on FPGAs are a natural match, whereas parallel execution of comparisons yields high sorting speedups. Speedups are demonstrated by our study and the extensive results in this chapter.
The limited number of comparators that can be realized on the FPGA chip
ABOUT THE AUTHORS
Anton Kos received his Ph.D. in electrical engineering from the University of Ljubljana, Slovenia, in 2006. He is a senior lecturer at the Faculty of Electrical Engineering, University of Ljubljana. He is a member of the Laboratory of Communication Devices at the Department of Telecommunications. His teaching and research work includes communication networks and protocols, quality of service, dataflow computing and applications, usage of inertial sensors in biofeedback
References (20)
The art of computer programming
(2002)Algorithms in Java
(2010)- et al.
Introduction to Algorithms
(2009) - et al.
Moving from Petaflops to Petadata
Communications of the ACM
(2013) Future Microprocessors: What Must We do Differently if We Are to Effectively Utilize Multi-Core and Many-Core Chips?
(2009)- Sorting Algorithm, http://en.wikipedia.org/wiki/Sorting_algorithm (accessed...
- Parallel Computing, http://en.wikipedia.org/wiki/Parallel_computing (accessed...
Surviving the Design of Microprocessor and Multimicroprocessor Systems: Lessons Learned
(2000)Sorting networks and their applications
- Sorting Network, http://en.wikipedia.org/wiki/Sorting_network (accessed...
Cited by (19)
A runtime job scheduling algorithm for cluster architectures with dataflow accelerators
2022, Advances in ComputersCitation Excerpt :It was designed by Ken Batcher for sorting networks of size O(n*(log n)2), but it gained popularity with GPU processing. Kos et al. implemented the Odd-Even Merge Network Sort algorithm using the Maxeler DataFlow hardware, and achieved acceleration factors of 100–150, when sorting more than 1000 arrays consecutively [37]. Each of the arrays consisted of 16–128 elements.
Fast sampling from Wiener posteriors for image data with dataflow engines
2018, Astronomy and ComputingAdaptation and Evaluation of the Simplex Algorithm for a Data-Flow Architecture
2017, Advances in ComputersCitation Excerpt :To be able to fully exploit the advantages of the data-flow architecture, the algorithms have to be carefully reengineered, since most algorithms were tailored specifically for the control-flow architecture and are thus often not directly implementable as a data-flow. Many algorithms have already been implemented for this architecture, from sorting algorithms [6,7], fast Fourier transform [8] to various physics simulations, e.g., weather models [9]. It has been most successful in classical domains of high-performance computing, i.e., various numerical applications.
Exploring Future Many-Core Architectures: The TERAFLUX Evaluation Framework
2017, Advances in ComputersCitation Excerpt :COTSon permits to integrate the architectural elements to enable a dataflow-based execution model on the top of control-flow elements, as explained in Section 1. Further recent research on dataflow-based system can be found in the literature [70–75]. In summary, exploring a future many-core architecture like TERAFLUX requires the flexibility, simulation speed, and scalability of COTSon.
Data Flow Computing in Geoscience Applications
2017, Advances in ComputersCitation Excerpt :The Maxeler DFE is a high-performance processing system that combines reconfigurable FPGAs to provide high efficiency and easy-to-use programming interfaces. There are some successful implementations and accelerations of serial applications using Maxeler systems [10–13]. Thus, in the rest of this section, we give a brief introduction to the Maxeler DFE systems.
Dataflow-Based Parallelization of Control-Flow Algorithms
2017, Advances in ComputersCitation Excerpt :For example, if each element, except the first one in a row of a table, depends on the previous element in the same row, but rows are mutually independent, instead of calculating the elements row by row, elements could be calculated column by column. Kos et al. achieved the speedup of 100–150 for sorting more than 1000 arrays consecutively using the dataflow implementation of the Odd–even merge network sort algorithm [69,70]. Arrays consisted of 16–128 elements; speedup is calculated as a ratio between the execution time of sorting on the CPU and the execution time of sorting on DFE.
ABOUT THE AUTHORS
Anton Kos received his Ph.D. in electrical engineering from the University of Ljubljana, Slovenia, in 2006. He is a senior lecturer at the Faculty of Electrical Engineering, University of Ljubljana. He is a member of the Laboratory of Communication Devices at the Department of Telecommunications. His teaching and research work includes communication networks and protocols, quality of service, dataflow computing and applications, usage of inertial sensors in biofeedback applications, and information systems. He is the (co)author of eight papers that appeared in the international engineering journals and of more than thirty papers on international conferences.
Vukašin Ranković received his B.Sc. in electrical engineering and information theory from the University of Belgrade, Serbia, in 2014. He is a master student at the School of Electrical Engineering, University of Belgrade. He also works at the Microsoft Development Center Serbia as a Software engineer in Bing Local Classification team. His research and work includes machine learning and dataflow computing. He is the (co)author of one paper that appeared in the international engineering journal and of one paper on international conference.
Sašo Tomažič received his Ph.D. in Electrical Engineering from the University of Ljubljana, Slovenia, in 1991. He is a full professor at the Faculty of Electrical Engineering, University of Ljubljana. He is the head of the Laboratory of Communication Devices and the head of the Department of Telecommunications. He was the advisor for networking, teleworking, and telematics at the Ministry of Education of Slovenia; the advisor for Telecommunications at the Ministry of Defense; and the National Coordinator for telecommunications research of the Republic of Slovenia. He is the (co)author of more than fifty papers that appeared in the international engineering journals and of more than a hundred papers on international conferences.