A multi-GPU biclustering algorithm for binary datasets

https://doi.org/10.1016/j.jpdc.2020.09.009Get rights and content

Highlights

  • gBiBit is the first binary biclustering method that can process large datasets.

  • gBiBit development includes a novel methodology to take full advantage of the GPU device resources.

  • gBiBit solves the memory issues in GPU technology when large datasets must be processed.

  • gBiBit also has a multi-GPU version that allow to use multiple GPU devices in a balanced way.

  • gBiBit source code is available at https://github.com/aureliolfdez/gbibit.

Abstract

Graphics Processing Units technology (GPU) and CUDA architecture are one of the most used options to adapt machine learning techniques to the huge amounts of complex data that are currently generated. Biclustering techniques are useful for discovering local patterns in datasets. Those of them that have been implemented to use GPU resources in parallel have improved their computational performance. However, this fact does not guarantee that they can successfully process large datasets. There are some important issues that must be taken into account, like the data transfers between CPU and GPU memory or the balanced distribution of workload between the GPU resources. In this paper, a GPU version of one of the fastest biclustering solutions, BiBit, is presented. This implementation, named gBiBit, has been designed to take full advantage of the computational resources offered by GPU devices. Either using a single GPU device or in its multi-GPU mode, gBiBit is able to process large binary datasets. The experimental results have shown that gBiBit improves the computational performance of BiBit, a CPU parallel version and an early GPU version, called ParBiBit and CUBiBit, respectively. gBiBit source code is available at https://github.com/aureliolfdez/gbibit.

Introduction

Nowadays, huge amounts of data are being generated and collected at unprecedented speed due to the technological advances in information technologies [8]. In this context, there is the need to adapt data analysis approaches and computational methods in order to support the volume, variety, speed and veracity of the data that are currently being generated, transmitted and analyzed [43]. In particular, Machine Learning (ML) algorithms are used to discover significant, non-trivial and useful patterns in datasets from different areas such as genetics and genomics [19], energy consumption [6], marketing [36] or social networks [11].

One of the most used ML approaches is clustering [40], whose aim is to obtain groups (clusters) of elements that share a set of common properties. However, the results from the application of standard clustering methods to certain datasets are limited. For example, in gene expression datasets, clustering techniques can be used to group either genes or experimental conditions. Nevertheless, many behavior patterns are common to a group of genes only under specific experimental conditions [23]. Thus, in order to extract such local patterns, it is necessary a different type of grouping technique able to perform simultaneous row–column clustering: Biclustering techniques. Biclustering was originally introduced in the 1970s by Hartigan [18] and it can be a NP-hard problem in which the search space is composed by all the possible overlapping subsets of elements that share a common subsets of properties [24]. Although most of the biclustering approaches proposed in the literature are based on efficient heuristics, recently big efforts have been carried out in order to adapt them to the current scientific and technological environment, in which huge volumes of complex data that may come from different sources are generated [13], [29].

To deal with this challenge, High Performance Computing (HPC) techniques are used to take advantage of all available hardware and software resources to create parallel and distributed computing strategies, with the aim of solving problems that involve huge volumes of data with a high computational cost [7], [10], [12]. For example, Apache Hadoop [39], released by Google in 2003, was the first solution to handle these kind of problems, dividing a huge task into a large quantity of small sub tasks [16]. Spark [42] is a fast engine for large scale data processing and it has become one of the main processing tool for massive datasets [35]. Besides, general-purpose computing on Graphics Processing Units (GPU) and its most common programming model, CUDA [27], is one of the most used HPC models for massive data processing [17], [37].

In this context, several biclustering techniques have selected GPU and CUDA to parallelize the generation of biclusters [15], [22], [30]. Although the use of this technology improves the computational performance of the biclustering techniques, it also involves facing implementation issues that have to be overcome to take advantage of the GPU devices, like the balanced distribution of tasks between the available resources or the transfer of data between the different types of data memories (see Section 2). Therefore, strategies must be created to intelligently manage the GPU resources and to use multiple GPU devices (multi-GPU) in order to, not only develop a faster algorithm but also allow the processing of large datasets.

This paper presents Bit-Pattern Biclustering GPU Algorithm (gBiBit), a multi-GPU version of the BiBit binary Biclustering algorithm [32]. This is the first implementation based on CUDA of a binary biclustering algorithm that allow to process large datasets. gBiBit has been designed to efficiently take advantage of all GPU resources, improving the performance of BiBit, one of the fastest biclustering algorithm due to the use of bit-level operations. The experimental results show that gBiBit also outperforms the computational performance of a previous CUDA parallel version of BiBit, called CUBiBit [15]. Neither BiBit nor CUBiBit were able to process large datasets during the experimentation. gBiBit source code is available at https://github.com/aureliolfdez/gbibit.

Therefore, we can summarize the contributions of this work as follows:

  • gBiBit development includes a novel methodology to take full advantage of the resources of a GPU device.

  • gBiBit solves the memory issues in GPU technology when large datasets must be processed.

  • gBiBit also has a multi-GPU version that allow to use multiple GPU devices in a balanced way.

The rest of the paper is structured as follows: Section 2 presents a brief review of related works. Section 3 describes the main features of the original algorithm, BiBit, and its new GPU version, gBiBit. The computational performance of gBiBit is tested in Section 4. Finally, Section 5 summarizes the main conclusions derived from this work.

Section snippets

Related work

The use of GPU technology is widely extended in multiple scientific areas such as Power Systems [44], Bioinformatics [38] or Medical Image Analysis [20]. The biclustering techniques adapted to be executed in GPU devices must face important challenges related with the use of this technology and the CUDA architecture [25], [28], [31]. On one hand, memory management involves making the right decision between using GPU shared memory (low latency and low storage capacity) or using GPU global memory

Methods

In this section, the main features of the original BiBit algorithm are summarized. Next, its new GPU version, gBiBit, is fully explained.

Results

The objective of this section is to compare the performance and the scalability of the gBiBit implementation with the BiBit original algorithm and the CUBiBit version. Also, another version of BiBit, called ParBiBit [14], has been considered for this experimentation. ParBiBit is based on C++11 [41] and includes support for threads and Message Passing Interface (MPI) [9] processes in order to exploit the compute capabilities of modern distributed-memory systems.

To this aim, two types of datasets

Conclusions

GPU technology and CUDA architecture are one of the most used options to adapt machine learning techniques to large and complex datasets. In the case of biclustering techniques, those that have been implemented to use GPU resources in parallel have improved their computational performance. However, this fact does not guarantee that these new biclustering algorithms versions can handle the processing of large datasets. This is because there are some important issues that must be taken into

CRediT authorship contribution statement

Aurelio Lopez-Fernandez: Methodology, Software, Validation, Investigation, Writing - original draft, Writing - review & editing. Domingo Rodriguez-Baena: Methodology, Supervision, Validation, Investigation, Writing - original draft, Writing - review & editing. Francisco Gomez-Vela: Supervision, Validation, Investigation, Writing - original draft, Writing - review & editing. Federico Divina: Validation, Writing - original draft, Writing - review & editing. Miguel Garcia-Torres: Validation,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Aurelio López Fernández is a Ph.D. student in Computer Science from the Pablo de Olavide University of Seville (within the Ph.D. program Biotechnology and Chemical Technology). His lines of research are mainly aimed at intelligent analysis of large datasets using Data Mining techniques in Big Data and HPC (High Performance Computing) environments.

References (44)

  • Gonzalez-DominguezJ. et al.

    Accelerating binary biclustering on platforms with cuda-enabled gpus

    Inform. Sci.

    (2019)
  • LitjensG. et al.

    A survey on deep learning in medical image analysis

    Med. Image Anal.

    (2017)
  • LiuB. et al.

    Gpu-based biclustering for microarray data analysis in neurocomputing

    Neurocomputing

    (2014)
  • Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin

    Cell

    (2014)
  • J. Arnedo-Fdez, I. Zwir, R. Romero-Zaliz, Biclustering of very large datasets with gpu tecnology using cuda, in:...
  • BhattacharyaA. et al.

    A gpu-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules

    Sci. Rep.

    (2017)
  • Y. C., G.M. C., Biclustering of expression data, in: Proceedings of the Eighth International Conference on Intelligent...
  • ConsortiumT.G.

    The genotype-tissue expression (gtex) pilot analysis: Multitissue gene regulation in humans

    Science

    (2015)
  • DivinaF. et al.

    Biclustering of smart building electric energy consumption data

    Appl. Sci.

    (2019)
  • R.P. Duarte, Á. Simões, R. Henriques, H.C. Neto, Fpga-based opencl accelerator for discovering temporal patterns in...
  • ErevellesS. et al.

    Big data consumer analytics and the transformation of marketing

    J. Bus. Res.

    (2016)
  • ForumT.M.

    MPI: A message passing interface

    (1993)
  • P. Foszner, P. Skurowski, Bi-cluster parallel computing in bioinformatics–performance and eco-efficiency, in:...
  • Galan-GarciaP. et al.

    Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying

    Logic J. IGPL

    (2016)
  • Gomez-PulidoJ.A. et al.

    Fine-grained parallelization of fitness functions in bioinformatics optimization problems: gene selection for cancer classification and biclustering of gene expression data

    BMC Bioinformatics

    (2016)
  • Gomez-VelaF. et al.

    Bioinformatics from a big data perspective: Meeting the challenge

  • González-DomínguezJ. et al.

    Parbibit: Parallel tool for binary biclustering on modern distributed-memory systems

    PLoS One

    (2018)
  • R. Gowri, R. Rathipriya, Mr-gabit: Map reduce based genetic algorithm for biclustering time series data, in: 2016 IEEE...
  • C. Gregg, K. Hazelwood, Where is the data? why you cannot debate cpu vs. gpu performance without the answer, in: (IEEE...
  • HartiganJ.A.

    Direct clustering of a data matrix

    J. Amer. Statist. Assoc.

    (1972)
  • LibbrechtM.W. et al.

    Machine learning applications in genetics and genomics

    Nature Rev. Genet.

    (2015)
  • LiuB. et al.

    Design exploration of geometric biclustering for microarray data analysis in data mining

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • Cited by (8)

    • FuBiNFS – fuzzy biclustering neuro-fuzzy system

      2022, Fuzzy Sets and Systems
      Citation Excerpt :

      A huge problem in biclustering of genomic data is their size [80]. This problem is addressed with logical functions [47], application of GPU – gBiBit algorithm [39], CUDA-enabled GPUs – CUBiBit algorithm [23]. The biclustering algorithm for binary data is described in [25].

    • A four-terminal-architecture cloud-edge-based digital twin system for thermal error control of key machining equipment in production lines

      2022, Mechanical Systems and Signal Processing
      Citation Excerpt :

      It is an elastic computing service that provides GPU computing power. It has super computing power and serves various application scenarios of the deep learning and scientific computing [65–66]. The retraining and updating of the ISOA-ILSTM network is realized by the deep learning workstation, and its processer is Intel Core i7 9700 K, and the GPU is GTX 2080TI2.

    View all citing articles on Scopus

    Aurelio López Fernández is a Ph.D. student in Computer Science from the Pablo de Olavide University of Seville (within the Ph.D. program Biotechnology and Chemical Technology). His lines of research are mainly aimed at intelligent analysis of large datasets using Data Mining techniques in Big Data and HPC (High Performance Computing) environments.

    Domingo Savio Rodríguez Baena received his Ph.D. in Computer Science from the Pablo de Olavide University of Seville (within the Ph.D. program Biotechnology and Chemical Technology), with a Cum Laude qualification, in addition to Computer Science Engineer by the University of Seville. His lines of research are focused on the intelligent analysis of data by applying Machine Learning and data mining techniques.He has mainly focused on applying biclustering techniques to genetic and biomedical data. In addition, he has recently focused on adapting data mining algorithms to Big Data environment.

    Francisco A. Gómez received his Ph.D. in Computer Science from the Pablo de Olavide University of Seville (within the Ph.D. program Biotechnology and Chemical Technology), with a Cum Laude qualification, in addition to Computer Science Engineer by the University of Seville. He has more than 5 years of experience in private companies of IT field before he began his research. His lines of research are primarily aimed at the treatment of information using intelligent techniques, applying Machine Learning, Big Data and data mining techniques.

    Federico Divina obtained his Ph.D. in Artificial Intelligence from the Vrije Universiteit of Amsterdam, and after that he worked as a postdoc at the University of Tilburg, within the European project NEWTIES. In 2006 he moved to the Pablo de Olavide University. He has been working on knowledge extraction since his Ph.D. thesis at the Vrije Universiteit of Amsterdam. He has extensive experience in the application of Machine Learning, especially techniques based on Soft Computing, for the extraction of knowledge from massive data.

    Miguel García Torres is an associate professor in the Escuela Politécnica Superior of the Universidad Pablo de Olavide. He received the BS degree in physics and the Ph.D. degree in computer science from the Universidad de La Laguna, Tenerife, Spain, in 2001 and 2007, respectively. After obtaining the doctorate he held a postoc position in the Laboratory for Space Astrophysics and Theoretical Physics at the National institute of Aerospace Technology (INTA). There, he joined the Gaia mission from the European Space Agency (ESA) and started to participate in the Gaia Data Processing and Analysis Consortium (DPAC) as a member of ”Astrophysical Parameters”, Coordination Unit (CU8). He has been involved in the ”Object Clustering Analysis” (OCA) Development Unit since then. His research areas of interests include machine learning, metaheuristics, big data, time series forecasting, bioinformatics and astrostatistics.

    View full text