High-throughput Ant Colony Optimization on graphics processing units

https://doi.org/10.1016/j.jpdc.2017.12.002Get rights and content

Highlights

  • We propose an agnostic vectorization approach for Ant Colony Optimization on GPUs.

  • Different communication and synchronization schemas at warp level are studied.

  • A new selection procedure, called SS-Roulette (Scan–Stencil Roulette), is introduced.

  • Atomic-based approach at Pheromone Update is also analyzed on different GPUs.

Abstract

Nowadays, computer researchers can face ever-complex scientific problems by using a hardware and software co-design. One successful approach is exploring novel massively-parallel Natural-inspired algorithms, such as the Ant Colony Optimization (ACO) algorithm, through the exploitation of high-throughput accelerators such as GPUs, which are designed to provide high levels of parallelism and low Energy per instruction (EP) cost through heavy vectorization. In this paper, we demonstrate how to take advantage of contemporary hardware-based CUDA vectorization to optimize the ACO algorithm when applied to the Traveling Salesman Problem (TSP). Several parallel designs are proposed and analyzed on two different CUDA architectures. Our results reveal that our vectorization approaches can obtain good performance on these architectures. Moreover, atomic operations are under study showing good benefits on latest generations of CUDA architectures. This work lays the groundwork for future developments of ACO algorithm on high-performance platforms.

Introduction

The end of two well-established theories that have traditionally guided the development of CMOS-based computing devices cannot be prevented. First, Moore’s law that enables computer architects to exploit more transistors per unit area will be shortly limited by physical constraints (sub-nanometer technology nodes seem to be unfeasible [49]). Moreover, Dennard scaling [12], which states that power dissipation per unit area decreases as transistor density increases, has been recently transgressed [20]. As a result, we are witnessing the green computing era,1 where traditional homogeneous computing platforms are shifting to heterogeneous systems which are capable of sustaining the desired increment in performance, but yielding high energy efficiency.

Heterogeneous architectures are equipped with specialized cores (CPU’s co-processors and GPUs) in order to acceleratedata-parallel algorithms (e.g., 3D graphics rendering, hashing,encryption) to obtain better overall system performance and energy efficiency [23]. The reason is that the design of these specialized cores integrates heavy vectorization or Single-Instruction Multiple-Data (SIMD) capabilities, that maximize performance per watt, particularly in data-intensive applications [8]. For instance, vectorization is available in most microprocessors designed and introduced in different market segments [5], such as integrated devices (ARM NEON), servers and desktop processors (AVX, SSE and SVE) [1], and also accelerators like NVIDIA GPUs [38] or Intel Xeon Phi [24]. Indeed, accelerators are the most popular option nowadays for accelerating massively parallel and data-intensive workloads.

This new landscape of computation forces programmers to redesign and even rethink their algorithms to satisfy energy and performance requirements, as run-time systems are still immature to accomplish both requirements at the same time. Some of the most representative exponents are population-based algorithms, such as PSO (Particle Swarm Optimization) [19], genetic algorithms [21], and ACO (Ant Colony Optimization) [[15], [16], [18]], as they are massively parallel by its mathematical definition, but their straight-forward implementation on GPUs architectures may not constitute the best possible solution on current architectures.

Of particular interest to us is enhancing the ACO algorithm. ACO mimics the observed behavior of ant colonies. The ACO method uses artificial ants to try to get the process of traversing the graph. A complete tour represents a solution. Trajectories are evaluated according to the quality of the solution represented by these paths. And then, these artificial ants make a deposit of “pheromones” accordingly (the better the solution, the more pheromone deposit they drop). To solve a problem by means of ACO algorithm, there are two main phases: tour construction, where ants run in parallel looking for solutions; and pheromone deposition, where ants communicate to each other.

Parallel versions of ACO have been developed so far [[11], [35], [43], [46], [54]]. In particular, in one of our previous work [6], we offer a GPU-ACO version that parallelizes the main phases of the ACO algorithm (i.e., pheromone deposition and tour construction), giving more emphasis on data parallelism. However, to the best of our knowledge, all the existing parallel implementations developed so far in these architectures are not able to take full advantage of the underlying hardware resources, since they propose a task-based parallelism, or they are computationally demanding to avoid serialization [[6], [7]].

In this paper, we rethink contemporary parallelization strategies for the two main phases of the ACO algorithm (i.e., pheromone deposition and tour construction), in order to optimize performance when running on NVIDIA GPU platforms such as Fermi, Kepler and Maxwell.

The main contributions of the paper include the following:

  • 1.

    We propose an agnostic vectorization scheme specifically geared towards massively parallel architectures for ACO’s main stage, named tour construction. This proposal establishes one ant to both 32-width vector (identifying one ant as a CUDA-warp) and 64-width vector. To implement the latter, we use partial synchronization and different communication schemes based on shuffle instructions combined with shared memory.

  • 2.

    We introduce a novel parallel implementation that mimics the behavior of the classical roulette selection procedure. This new implementation is called SS-Roulette, in reference to the patterns used in its implementation (i.e., the patterns named Scan and Stencil), and is presented as a new implementation that improves GPU data parallelism.

  • 3.

    A complete review of the main phases of the ACO algorithm tested against different instances of the TSP (Traveling Salesman Problem) is provided. We tune different GPU parameters up to offer a 3× factor for the tour construction phase, comparing our contribution against the best GPU development published. The usage of atomic instructions is analyzed on the pheromone update stage to conclude that in new generations of NVIDIA GPUs, the use of atomic instructions over global memory increases performance for disperse memory accesses. Moreover, we propose a joint execution of the two main ACO phases (i.e., pheromone deposition and tour construction) in just one kernel.

The article is structured as follows. Section 2 contains a revision of ACO and the CUDA architecture. Section 3 shows the parallelization techniques we use to improve the execution of ACO in NVIDIA GPUs. Section 4 describes the hardware and software environments before an in-deep analysis is carried out in Section 5. Section 6 summarizes related works that are relevant to this topic, and finally the paper ends with Section 7, highlighting main conclusions and several proposals for future work.

Section snippets

Ant Colony Optimization (ACO) for the Traveling Salesman Problem (TSP)

We refer to our previous descriptions of ACO in [6]. The well-known Traveling Salesman Problem (TSP) [28] consists of finding the shortest route to visit the whole graph, visiting once all the nodes (nodes = cities) and returning to the origin node. The usual way to represent the symmetric TSP on n cities is with a complete graph, with n nodes (“cities”), G. Each edge is weighted, ei,j and the distance di,j between cities i and j or j and i is the same, describing a complete weighted graph. The

Parallelization strategies

This section briefly introduces several parallel designs for solving the TSP with the Ant System (AS). We start from our previous description that was given in [6]. Algorithm 1 provides a brief introduction to the main AS structure as applied to the TSP. First, some data-structures are initialized such as number of cities, the distance matrix, and so on. After that, the two main stages of ACO are performed (i.e., tour construction and pheromone update). Indeed, these two functions perform the

Experimental setup

This section briefly introduces the hardware and software environment where experiments shown in Section 5 are carried out.

Experimental results

This section deeply shows the experimental results of our designs on several NVIDIA GPU architectures. Experiments below include performance and quality evaluation over the two main stages of ACO. In the performance evaluation among different GPU generations, the memory contention comparing the quality results obtained by sequential counterpart version vs atomics and non-atomic operations, and the performance by atomic operations among different GPU generations.

Parallel implementations

Thomas Stüzle [46] shows a straight-forwards ACO parallelization, in which several executions are executed independently on several processors. There is not a communication overhead on parallel executions and thus, the final solution is the one that is obtained from the best of all the solutions of the independent execution of multiple instances. Enhancements can be applied to parallel runs that do not communicate through information exchange between processors. Middendorf and Michel in [36]

Conclusions and future work

Nature-inspired metaheuristics such as Ant Colony Optimization have been successfully applied to many NP-complete optimization problems. We concluded that coarse-grained parallelism implemented in previous designs does not fit perfectly on GPUs architecture, and thus, to overcome this problem, different parallelization strategies are analyzed. First,we propose an agnostic vectorization of the tour construction stage of ACO, specifically designed for massively parallel architectures;

Acknowledgments

This work is jointly supported by a travel grant from the EU FP7 NoE HiPEAC IST-217068, the European Network of Excellence on High Performance and Embedded Architecture and Compilation, by the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grant 18946/JLI/13, and by the Spanish MEC and European Commission FEDER under grant TIN2016-78799-P (AEI/FEDER, UE). We also thank NVIDIA for hardware donation under GPU Educational Center 2014–2016 and Research Center

José M. Cecilia received his B.S. degree in Computer Science from the University of Murcia (Spain, 2005), his M.S. degree in Computer Science from the University of Cranfield (United Kingdom, 2007), and his Ph.D. degree in Computer Science from the University of Murcia (Spain, 2011). He was a predoctoral researcher at Manchester Metropolitan University (United Kingdom, 2010), supported by a collaboration grant from the European Network of Excellence on High Performance and Embedded Architecture

References (54)

  • ChenC.P. et al.

    Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

    Inform. Sci.

    (2014)
  • L. Chen, H.-Y. Sun, S. Wang, (2008) Parallel implementation of ant colony optimization on MPP, in: Machine Learning and...
  • DawsonL. et al.

    Improving Ant Colony Optimization performance on the GPU using CUDA

  • DennardR.H. et al.

    Design of ion-implanted MOSFET’s with very small physical dimensions

    IEEE J. Solid-State Circuits

    (1974)
  • G. Dongdong, G. Guanghong, H. Liang, L. Ni, Application of multi-core parallel ant colony optimization in target...
  • DorigoM.

    Optimization, learning and natural algorithms

    (1992)
  • DorigoM. et al.

    Ant colony optimization

    IEEE Comput. Intell. Mag.

    (2006)
  • M. Dorigo, G. Di Caro, Ant colony optimization: a new meta-heuristic, in: Evolutionary Computation, 1999. CEC 99....
  • M. Dorigo, V. Maniezzo, A. Colorni, The ant system: Optimization by a colony of cooperative agents,...
  • DorigoM. et al.

    Ant colony optimization: overview and recent advances

  • EberhartR. et al.

    A new optimizer using particle swarm theory

  • GeistA. et al.

    A survey of high-performance computing scaling challenges

    Int. J. High Perform. Comput. Appl.

    (2015)
  • GoldbergD.E. et al.

    Genetic algorithms and machine learning

    Mach. Learn.

    (1988)
  • GuerreroG.D. et al.

    Comparative evaluation of platforms for parallel Ant Colony Optimization

    J. Supercomput.

    (2014)
  • HardavellasN. et al.

    Toward dark silicon in servers

    IEEE Micro

    (2011)
  • JeffersJ. et al.

    Intel Xeon Phi Coprocessor High-performance Programming

    (2013)
  • JieningW. et al.

    Implementation of ant colony algorithm based on GPU

  • Cited by (0)

    José M. Cecilia received his B.S. degree in Computer Science from the University of Murcia (Spain, 2005), his M.S. degree in Computer Science from the University of Cranfield (United Kingdom, 2007), and his Ph.D. degree in Computer Science from the University of Murcia (Spain, 2011). He was a predoctoral researcher at Manchester Metropolitan University (United Kingdom, 2010), supported by a collaboration grant from the European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC) and visiting professor at the Impact group headed by Professor Wen-Mei Hwu at University of Illinois (Urbana, IL, USA). He has published several papers in international peer-reviewed journals and conferences. His research interest includes heterogeneous architecture as well as bio-inspired algorithms for evaluating the newest frontiers of computing. He is also working in applying these techniques to challenging problems in the fields of Science and Engineering. Now, he is working as Associate Professor at the Computer Science Department in the Catholic University of Murcia. He is teaching several lectures such as Introduction to Parallel Computing, Object-Oriented Programming, Operative System, Computer Architecture and Computer Graphics; all of them are part of the Computer Science degree.

    Antonio Llanes received his B.S. and M.S. degrees in Computer Science in Univ. of Murcia, Murcia, Spain, in 2006 and 2010, respectively. He received his Ph.D. at Catholic University of Murcia, Murcia, Spain, in 2016, where he is an assistant professor from 2006. He has been involved in several regional and international projects, like SENECA and NILS mobility. His main research interests are parallel computing, AI, and Bioinformatics applications.

    José L. Abellán is an assistant professor with the Computer Science Department, Universidad Catolica de Murcia (UCAM), Murcia, Spain. He received the B.S., M.S., and Ph.D. degrees in Computer Science from the University of Murcia, Murcia, Spain, in 2007, 2008 and 2012, respectively. From 2012 to 2014, he was a post-doctoral researcher at Photonics Center, Boston University, Boston, MA, USA. He is author of about 30 papers in refereed conferences and journals. He has served as a committee member in several international conferences. His research interests include intra/inter-chip networks and memory hierarchy designs for CPU and GPGPU platforms, silicon-photonic link technology, and HW/SW-acceleration for Deep Learning.

    Juan Gómez-Luna received B.S. and M.S. degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the Ph.D. degree in Computer Science from the University of Córdoba, Spain, in 2012. Between 2005 and 2017, he was a lecturer at the University of Córdoba. In 2017 he joined the Systems group at ETH Zürich. His research interests focus on parallel computing and heterogeneous computing.

    Li-Wen Chang is a Software Engineer in the Microsoft AI and Research Group. He received the B.S. degree in Electrical Engineering from National Taiwan University, Taiwan, in 2007, and the M.S. and Ph.D. degrees in Electrical and Computer Engineering from University of Illinois at Urbana–Champaign, IL, USA, in 2014 and 2017, respectively. His research interests include optimization for compilers, parallel computing, heterogeneous computing, and high-performance computing for deep learning.

    Wen-Mei W. Hwu received the Ph.D. degree in computer science from the University of California, Berkeley, 1987. He is the Walter J. (“Jerry”) Sanders III-Advanced Micro Devices endowed chair of electrical and computer engineering at the University of Illinois at Urbana–Champaign. His research interests include the areas of architecture, implementation, software for high-performance computer systems, and parallel processing. He is a principal investigator (PI) for the petascale Blue Waters system, a codirector of the Intel and Microsoft funded Universal Parallel Computing Research Center (UPCRC), and PI for the world’s first NVIDIA CUDA Center of Excellence. He is the chief scientist of the Illinois Parallel Computing Institute and the director of the IMPACT lab. For his contributions to the areas of compiler optimization and computer architecture, he received the 1993 Eta Kappa Nu Outstanding Young Electrical Engineer Award, the 1994 University Scholar Award of the University of Illinois, the 1997 Eta Kappa Nu Holmes MacDonald Outstanding Teaching Award, the 1998 ACM SigArch Maurice Wilkes Award, the 1999 ACM Grace Murray Hopper Award, the 2001 Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the 2006 most influential ISCA paper award, and the University of California, Berkeley distinguished alumni in computer science award. From 1997 to 1999, he was the chairman of the Computer Engineering Program at the University of Illinois. In 2007, he introduced a new engineering course in massively parallel processing with David Kirk of NVIDIA. He is a fellow of IEEE and of the ACM.

    View full text