Efficient 2D and 3D watershed on graphics processing unit: block-asynchronous approaches based on cellular automata

https://doi.org/10.1016/j.compeleceng.2013.04.020Get rights and content

Highlights

  • We present a cellular automata based watershed that exploits the GPU architecture.

  • We propose a block-asynchronous strategy that maps the cellular automata on the GPU.

  • The artifacts produced by the asynchronous updating scheme are corrected.

  • The block-asynchronous method is tuned to be applied to 2D and 3D images.

  • The block-asynchronous strategy is also adequate for multicore architectures.

Abstract

The watershed transform is a method for non-supervised image segmentation. In this paper we show that a watershed algorithm based on a cellular automaton is a good choice for the recent GPU architectures, especially when the synchronization rules are relaxed. In particular, we propose a block-asynchronous computation strategy that maps the cellular automaton on the thread blocks of the GPU. This method reduces the number of points of global synchronization allowing efficient exploitation of the memory hierarchy of the GPU. We also avoid the artifacts produced in the watershed lines by the block-asynchronous updating scheme by correcting the data propagation speed among the blocks. The proposals are compared to an OpenMP multithreaded code. The high speedups indicate the potential of this kind of algorithm for new architectures based on hundreds of cores. The method is tuned to be applied to 3D volumes obtaining similar results.

Introduction

The watershed transform is a non-supervised region-based segmentation tool for digital images. The idea behind this method comes from geography. A grey scale image can be represented as a topographic relief, where the height of each pixel is directly related to its grey level. The dividing lines of the catchment basins for precipitation falling over the region are called watershed lines [1]. Various definitions, algorithms and proposals can be found in the literature but, in practice, they can be classified into two groups: those based on the specification of a recursive algorithm by Vincent and Soille [1], and those based on the distance functions defined by Meyer [2].

One of the main advantages of the watershed transform is that all regions of the image are well defined at the end of the segmentation process, even if the contrast of the image is poor. For this reason it has been widely used in image processing, e.g., for medicine and biology [3]. The results of applying directly the watershed algorithm are over-segmented owing to the large number of regions detected [4]. This problem is overcome by preprocessing the image with the objective of reducing the number of regions; e.g., by applying a filter to improve the image contrast, or processing a marker-controlled watershed transform that preselects the regions of interest [4]. The computational cost of these tasks joins to high computational cost of the watershed segmentation itself.

Different sequential algorithms have been designed to compute the watershed transform using sequential structures as queues [1] or graphs [5] to simulate the flooding process. Due to the recursive nature of the watershed transformation, its parallelization is not a trivial task. Two early algorithms for computing the watershed transformations on parallel computers were developed by Moga et al. [6]. Both algorithms start by detecting the regional minima, and then the image is lower-complete transformed and represented as a graph or forest. Two methods of propagating the labels of minima are proposed and compared. These algorithms are implemented using the message passing paradigm and executed on multiprocessor systems. Moreover, the wavefront technique was introduced in this work. This is used to propagate the labels from the border to the inner of a flat zone, so the propagation reaches the middle of the flat zone at the same time. In the field of parallel processing, most efforts have focused on multiprocessors [6] and to a lesser extent on specific architectures, such as field-programmable gate arrays (FPGAs). A compendium of algorithms and parallelization strategies for watershed is presented in [7].

With high computational power, Graphic Processor Units (GPUs) have evolved into low-cost, multithreaded, multicore processors with enormous computational power, which are now common in PC hardware. The Computed Unified Device Architecture (CUDA) developed by NVIDIA, based on a data parallel programming model, provides support for general-purpose computing on graphics hardware. In recent years, a number of parallel implementations of watershed algorithms in GPUs have been published [8], [9], [10], [11]. In particular, a watershed algorithm implemented on GPU using CUDA was presented by Körbes et al. [10]. The algorithm is a four-step procedure. The efficiency of this algorithm is a result of the labeling method used, greatly reducing the number of iterations required for the task completion. The algorithm requires several steps of synchronization and non-local data movement.

Cellular automata (CA) constitute a computing model that has been extensively used for artificial life, pattern recognition [12] or image processing [13]. The popularity of CA is mainly due to the simplicity of modeling complex problems with the help of local information only. CA are dynamical systems that consist of a n-dimensional array of cells [14], each one of which can be in one of a finite number of possible states. They are synchronously updated in discrete time steps according to a local, identical, interaction rule. This requires a strict order for the component updates, where a cell cannot be updated until all other cells have been updated. The concept of parallelism is therefore implicit in CA and matches the computing model of the GPUs, multicore and many-core systems, as it is based on the individual evolution of different cells based on local information. This is the reason why different parallel algorithms based on CA for GPU and multicore systems have been developed for general purpose computing, visualization and image processing [15].

Watershed implementations based on CA were proposed using shaders in the graphic pipeline of the GPU [8]. This implementation is synchronous as it updates the entire image at each step of the evolution of the automaton. The authors developed an algorithm using image integration via the Ford-Bellman shortest paths algorithm. From each minimum of the image, a wavefront is started, labeled by the index of the minimum it started from, and the distance is initialized with the value of the minimum. The transition rule must be synchronously applied to all the cells. This algorithm is simple and fast, but it requires a previous detection of minima through the use of any other method. Galilée et al. [16] introduced a parallel algorithm-architecture based on asynchronous CA to compute the watershed transform, updating each pixel as soon the information from the neighbors becomes available.

In general, the asynchronous computing model is suitable for parallel computing where a problem is split into independent subproblems, each one solved by a different processor, minimizing the number of interprocessor communications that imply synchronization points. This data-parallel model is used by CUDA. In the CUDA parallel model, a multithreaded program is partitioned into blocks of threads that run independently from each other [17]. The order in which blocks are computed is not preset so the communications among them must be performed by means of synchronization points, that are costly. So, we have found that the asynchronous algorithm described in [16] is particularly suitable for the CUDA computing model as different regions of the image can be simultaneously and independently updated during certain number of steps, thus reducing the number of synchronization points. Taking into account that multi-CPU computers also benefit from a high computations to communications ratio, research on algorithms that can be partitioned into blocks and asynchronously computed is relevant.

In this paper we present a block-asynchronous computation method for the watershed algorithm based on CA (CA-watershed) defined by [16]. We first study two proposals based on 2D CA, thus applied to 2D images, and finally they are extended to the case of 3D volumes. The proposals are called block-asynchronous and artifact-free block-asynchronous. An early version for 2D images was published in [18] and applied to hyperspectral images in [19].

We compare our GPU watershed proposals to an efficient multithreaded OpenMP implementation of the watershed on the CPU. When the block-asynchronous CA-watershed algorithm is computed a problem arises as the quality of the segmentation is slightly affected. In particular, the algorithm presents the problem of data propagation at the block boundaries which causes undesirable artifacts. The artifact-free block-asynchronous algorithm is based on the application of a technique known as wavefront [6] increasing the quality of the watershed lines obtained.

This paper is organized as follows: Section 2.1 introduces the watershed transform and, Section 2.2, the main concepts associated to CA. The watershed algorithm based on CA is described in Section 2.3. In Section 3 we present the GPU architecture and the CUDA programming model. The different GPU algorithms are described in Section 4 and the results obtained are discussed in Section 5. Finally, Section 6 presents the conclusions.

Section snippets

Watershed based on cellular automata

In this section we introduce the watershed transform and the CA principles, and describe a watershed algorithm based on CA.

First, we introduce a few concepts and notations regarding topography in order to continue with the watershed transform. A grey scale image may be considered as a graph G = (V, A) with a finite set of V vertexes (pixels) and a set of arcs A  V × V defining the connectivity. Two pixels u and v are connected if (u, v)  A. The pixels connected to u, called neighbors, are denoted by N

GPU architecture

GPUs provide massively parallel processing capabilities based on a data-parallel architecture. There are Application Programming Interfaces (APIs) for writing programs that are executed in the GPU, such as CUDA for NVIDIA devices, or OpenCL, for heterogeneous platforms.

A CUDA program, which is called a kernel, is executed by thousands of threads grouped into blocks. The blocks are arranged into a grid and scheduled to any of the available GPU cores which enables automatic scalability for future

Block-asynchronous GPU algorithm

In this section we present the GPU algorithm for the CA-based watershed introduced in Section 2.3. We first explain the method for a 2D automaton and then apply it to 2D images. A synchronous implementation of the watershed algorithm on the GPU is presented in Section 4.1. Then we develop a more efficient block-asynchronous GPU algorithm in Section 4.2. As a consequence of the computation by blocks, the block-asynchronous watershed algorithm presents some undesirable artifacts that are

Results

We have evaluated our proposals on a PC with an Intel Core i7 with four cores at 2.80 GHz and 8 GB of RAM. Each core has separate L1 caches for instructions and data, and a unified L2 cache. The unified L3 cache is common to all the cores, as shown in Table 1. The GPU is a NVIDIA GeForce GTX580, consisting of 16 SMs, each one with 32 SPs. The GPU memory size is 1536 MB and its cache architecture consists of a unified L1 cache per SM and a L2 unified cache of 768 KB shared by all the SMs. The L1 cache

Conclusions

A block-asynchronous strategy to compute the cellular automata based watershed on the GPU was studied. The implicit parallelism of CA consisting in independent cells that evolve following a set of rules, perfectly matches the computing model of modern GPUs. The block-asynchronous proposal relaxes the synchronization requirements thus exploiting the computing capabilities of GPUs to the maximum. Moreover, this implementation also matches the computing requirements of multi-CPU computers, and

Acknowledgments

This work was supported in part by the Ministry of Science and Innovation, Government of Spain, cofounded by the FEDER funds of European Union, under contract TIN 2010-17541, and by Xunta de Galicia, Program for Consolidation of Competitive Research Groups ref. 2010/28. Pablo acknowledges financial support from the Ministry of Science and Innovation, Government of Spain, under a MICINN-FPI grant.

Pablo Quesada-Barriuso received his B.S. in Computer Science and his M.S. in Graphics, Games and Virtual Reality in 2007 and 2010 respectively. He joined the Computer Architecture Group of the University of Santiago de Compostela as a research assistant, and he currently pursuing a Ph.D. in Computer Science. His main research interests include image processing, parallel algorithms and GPUs.

References (25)

  • F. Meyer

    Topographic distance and watershed lines

    Signal Process

    (1994)
  • F. Meyer et al.

    Morphological segmentation

    J Visual Commun Image Represent

    (1990)
  • K.A. Hawick et al.

    Parallel graph component labelling with GPUs and CUDA

    Parallel Comput

    (2010)
  • L. Vincent et al.

    Watersheds in digital spaces: an efficient algorithm based on immersion simulations

    IEEE Trans Pattern Anal Mach Intell

    (1991)
  • V. Grau et al.

    Improved watershed transform for medical image segmentation using prior information

    IEEE Trans Med Imag

    (2004)
  • A. Bieniek et al.

    A connected component approach to the watershed segmentation

  • A.N. Moga et al.

    Parallel watershed transformation algorithms for image segmentation

    Parallel Comput

    (1998)
  • J.B.T.M. Roerdink et al.

    The watershed transform: definitions, algorithms and parallelization strategies

    Fundam Inf

    (2000)
  • Kauffmann C, Piche N. Cellular automaton for ultra-fast watershed transform on gpu. In: Proc. of the 19th int. conf. on...
  • Wagner B, Müller P, Haase G. A parallel watershed-transformation algorithm for the GPU. In: Proc. of the workshop on...
  • A. Körbes et al.

    Advances on watershed processing on GPU architecture in mathematical morphology and its applications to image and signal processing

    (2011)
  • Hučko M, Šrámek M. Streamed watershed transform on GPU for processing of large volume data. In: Proceedings of spring...
  • Cited by (0)

    Pablo Quesada-Barriuso received his B.S. in Computer Science and his M.S. in Graphics, Games and Virtual Reality in 2007 and 2010 respectively. He joined the Computer Architecture Group of the University of Santiago de Compostela as a research assistant, and he currently pursuing a Ph.D. in Computer Science. His main research interests include image processing, parallel algorithms and GPUs.

    Dora B. Heras received a M.S. degree in Physics in 1994 and a Ph.D. in 2000. She is currently an Associate Professor in the Department of Electronics and Computer Engineering at the same University. Her research interests include parallel and distributed computing, software optimization techniques for emerging architectures, computer graphics for high performance computing and image processing.

    Francisco Argüello received the B.S. and Ph.D. degrees in Physics from the University of Santiago, Spain in 1988 and 1992, respectively. He is currently an associate professor in the Department of Electronic and Computer Engineering at the University of Santiago, Spain. His current research interests include signal and image processing, computer graphics, parallel and distributed computing, and quantum computing.

    Reviews processed and recommended for publication to Editor-in-Chief by Guest Editor Prof. Jesus Carretero.

    View full text