BOOTABLE: Bioinformatics benchmark tool suite for applications and hardware

doi:10.1016/j.future.2019.09.057

Future Generation Computer Systems

Volume 102, January 2020, Pages 1016-1026

https://doi.org/10.1016/j.future.2019.09.057 Get rights and content

Highlights

•
Implementation of a benchmark tool suite based on bioinformatics applications.
•
Convenient installation and usage design.
•
Resource consumption and scaling investigation of any desired tool.
•
Generic tool integration.

Abstract

The interest in analyzing biological data on a large scale has grown over the last years. Bioinformatics applications play an important role when it comes to the analysis of huge amounts of data. Due to the large amount of biological data and/or large problem spaces a considerable amount of computing resources is required to answer the raised research questions. In order to estimate which underlying hardware might be the most suitable for the bioinformatics tools applied, a well-defined benchmark suite is required. Such a benchmark suite can get useful in the case of purchasing hardware and even further for larger projects with the goal to establish a bioinformatics compute infrastructure. With this paper we present BOOTABLE, our bioinformatic benchmark suite. BOOTABLE currently contains seven popular and widely used bioinformatic applications representing a broad spectrum of usage characteristics. It further includes an automated installation procedure and all required datasets. Furthermore it includes functionalities to test any desired application with regards to resource consumption and scaling behavior.

Graphical abstract

Introduction

The success of sequencing technologies led to a large growth of genomics data on the scale of tera- and petabytes. At the same time, the demand for analytical methods that can analyze the created data on a large scale increased. The research field of bioinformatics has become more and more important to provide algorithms, methods and applications for researchers to examine the created data and extract meaningful parts to answer biological questions. In order to handle the huge amount of data and analyze them, a rather large amount of computing resources is required. Bioinformatic algorithms and applications mostly try to solve NP-Hard problems or harder [1]. In order to solve such problems in a relevant time or find a sufficient approximation, large computing resources are necessary. But also other research topics settled in the field of bioinformatics, or around it, benefit from large available computing resources like molecular dynamics simulation in computational chemistry or machine learning methods. These fields also handle large datasets, albeit not in the range of petabytes but the calculations for the simulation and training tasks are very complex and would benefit from large computing resources. No matter which discipline in bioinformatics or related fields you are focused on, there is a large demand for computer-aided methods requiring computational resources. The required resources can be available as a commercial computing cloud like [2], [3], [4] or academic clouds like the de.NBI cloud [5], bwCloud [6], or a hybrid one like Helix Nebula [7]. No matter how the computational resources are provided, at some point it has to be decided which kind of hardware has to be procured.

Usually the required computational hardware has to be purchased/acquired or you have the choice between already existing hardware. To decide which hardware should be used, it would make sense to test your common bioinformatics applications or at least the topic of the used tools like sequence analysis tools [8], protein folding prediction tools [9] or molecular dynamics simulation tools [10]. Questions raised could be for example: How robust is the underlying hardware? How fast can it deliver the application results? Is the number of compute cores more important than the clock rate? How much Random-Access Memory (RAM) is required? All these and many more questions need to be considered in order to purchase and use computing hardware in an efficient and cost-saving way.

To test different hardware you will need some kind of test environment consisting of a range of tools and suitable datasets. All these base elements combined would lead to a so-called benchmarking suite that handles all the measurements. In order to get comparable results a benchmark suite has to fulfill some constraints. First, every tested tool needs to be installed in exactly the same version and in the same way. Second, if it needs to be compiled, it must be the same compiler with the same parameters to exclude any variation and therefore deviations in the measured values. In addition to the applications, the data used to run the desired tools, also has to be the same and even further the datasets has to be chosen carefully, because their size or complexity will have a direct impact on the runtime. If we summarize all these requirements above we are talking about a reproducible deployment that must guarantee a minimum standard of equality. Such a proposed benchmark tool would have applications in different real world scenarios aside from a hardware procurement process. A choice of examples are presented in the following.

In order to buy and test hardware you normally contact the hardware vendor of your trust. The hardware vendor or reseller, in most cases, has no clue about your used applications whether on how to install them, which datasets are required, neither how to execute them. Furthermore, hardware vendors might have other opinions than a researcher about which results of a benchmark are meaningful for the performance of the used tools. Of course you could explain them how to install and run the applications and what you would like to know in the end, but this can get time-consuming.

An other point from the sight of the hardware operator is to focus on a well formed utilization of the underlying hardware in terms that requested resources are used wisely by the user and not wasted and therefore prevent other users from using the resources. In order to give users an estimation about which amount of resources would make sense, the operator will need some measurements of the used tools. For large and commercial providers this might not be interesting as the users pay for the allocated resources but for academic and also more specialized compute centers, supporting a specific community and have to care about their resources this might be worth the effort to investigate the scaling behavior of the used tools and applications.

Even the hardware vendors directly would benefit from a benchmark suite. They would not have to care about how to install the tools, run the tools or find suitable datasets for them. This possibility could lead to systems specialized on specific workloads, especially for bioinformatics.

From the view of application developers it would be a real help to test a tool regarding the utilized resources and also the scaling behavior to see where the limits or bottlenecks are and eventually to overcome them. For example it would not make sense to increase the throughput of an application or speed up the underlying algorithm if the bottleneck is the read/write speed of the disk. But to reveal a limit like this, a supporting tool that does all the measurements would be helpful.

Section snippets

Related work

Most of these use cases can be handled by providing a benchmark suite focusing on bioinformatics applications. As there can be found only few published works in the literature [1], [11] there seems to be a lack of such benchmark suites, especially regarding multithreaded applications. With BioPerf, Bader et al. presented a benchmark suite to evaluate high performance computer architectures on bioinformatics applications. The benchmark suite includes ten different bioinformatics applications

The BOOTABLE benchmark suite

To allow resource providers to explore and evaluate their current hardware with regards to bioinformatics workloads we developed BOOTABLE initially. During the implementation phase we extended it to make it also usable for application developers, to get an overview or even deep insights into the resource consumption and scaling behavior of their developed tools. The subsequent sections explain how BOOTABLE is used and what kind of implemented features are available.

Tools and datasets

The bioinformatics tools, applications and packages used for BOOTABLE, to generate a workload close as possible to real word examples are handpicked and based on the usage behavior of our cloud users in the de.NBI cloud [5]. The applications are not only distributed over different areas in bioinformatics, they also cover broader areas from life sciences like molecular dynamics and machine learning.

The same holds for the datasets, they are carefully chosen, like the applications. First we assure

Discussion tools and datasets

As previously presented in detail, all the used tools and their algorithms have their own behavior regarding their complexity in time and space. Mostly all of the tools are dependent on the input size whether it is the number of sequences, or the size of a molecule. Only for GROMACS and the CIFAR-10 model the runtime can be regulated by the number of steps which makes them more independent of their input regarding the time complexity but not of their memory usage. All other tools strongly

Conclusions

Bioinformatics applications represent a non-negligible computer workload. In order to build a suitable infrastructure and design computationally efficient applications a well defined benchmark environment with a broad spectrum of bioinformatics applications is clearly required. This paper presents a benchmark tool suite with popular applications from the bioinformatics topics of sequence assembly, sequence alignment, molecular dynamics and machine learning. Suppliers of computational resources

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors acknowledge support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 37/935-1 FUGG. Part of the work presented here was also supported through BMBF funded project de.NBI (031 A 534A) and MWK Baden-Württemberg funded project CiTAR (“Zitierbare wissenschaftliche Methoden”).

Maximilian Hanussek

• Bachelor of Science (B.Sc.) in Bioinformatics at the University of Tübingen, Germany (2014).

• Master of Science (M.Sc.) in Bioinformatics at the University of Tübingen, Germany (2017).

• Since April 2017 working on my Ph.D. at the University of Tübingen, Germany.

• Since April 2017 responsible for the de.NBI cloud site Tübingen at the computing center of the University of Tübingen (Zentrum für Datenverarbeitung, ZDV), Germany.

• Current working fields are cloud computing

References (53)

David A. Bader, Yue Li, Tao Li, BioPerf: A benchmark suite to evaluate high-performance computer architecture on...
Amazon
Amazon elastic compute cloud (amazon ec2)
Google
Google cloud computing, hosting services & APIs
Microsoft
Microsoft azure cloud computing platform; services
TauchAndreas et al.
Bioinformatics in Germany: Toward a national-level infrastructure
Brief. Bioinform.
(2019)
SchulzJanne Chr.
Überlegungen Zur steuerung einer föderativen infrastruktur am beispiel von bwcloud
MeginoFernando H. Barreiro et al.
Helix Nebula and CERN: A Symbiotic approach to exploiting commercial clouds
J. Phys. Conf. Ser.
(2014)
GoujonMickael et al.
A new bioinformatics analysis tools framework at EMBL-EBI
Nucleic Acids Res.
(2010)
GodzikAdam
Fold recognition methods
Methods Biochem. Anal.
(2003)
KarplusMartin et al.
Molecular dynamics simulations of biomolecules
Nature Struct. Mol. Biol.
(2002)

Kursad Albayraktaroglu, Aamer Jaleel, BioBench: A benchmark suite of bioinformatics applications, in: ISPASS 2005 -...

AltschulStephen F. et al.

Basic local alignment search tool

J. Mol. Biol.

(1990)

NotredameCédric et al.

T-coffee: A novel method for fast and accurate multiple sequence alignment

J. Mol. Biol.

(2000)

ThompsonJulie D. et al.

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

(1994)

PearsonWilliam R.

Rapid and sensitive sequence comparison with FASTP and FASTA

Methods Enzymol.

(1990)

Michael Larabel, Matthew Tippett, Phoronix test suite, Accessed 22 July 2019,...

Zeki Bozkus, Basilio B. Fraguela, A portable high-productivity approach to program heterogeneous systems, in:...

HanussekMaximilian

Maximilianhanussek/BOOTABLE: Initial release

(2019)

WuPeggy et al.

ANSIBLE

(2017)

AndersonCharles

Docker

IEEE Softw.

(2015)

KurtzerGregory M. et al.

Singularity: Scientific containers for mobility of compute

PLoS One

(2017)

El MaguiriA.

Openstack

Proc. Inst. Civil Eng. Waste Resour. Manage.

(2016)

GriffithsNigel

Nmon performance: A free tool to analyze AIX and Linux performance

(2003)

Smxi, Inxi, Accessed 11 January 2019,...

AmstutzPeter et al.

Common workflow language, vol. 1.0

Craig VenterJ. et al.

The sequence of the human genome

Science

(2001)

Cited by (5)

A comprehensive review and conceptual framework for cloud computing adoption in bioinformatics
2023, Healthcare Analytics
The healthcare industry generates enormous amounts of data, particularly in genomics, which require effective handling, storage, and analysis. Cloud computing offers a promising solution to these challenges. This paper focuses on the adoption of cloud computing in bioinformatics and evaluates current applications in the field. The article also proposes a comprehensive conceptual framework for cloud computing adoption in bioinformatics, which can serve as a valuable reference for practitioners, academics, and organizations transitioning to cloud-based bioinformatics services.
While cloud computing and bioinformatics can provide significant benefits and new opportunities in healthcare, data privacy, security, and management concerns are significant challenges that must be addressed. Organizations must implement robust security measures to protect sensitive patient information and ensure regulatory compliance. Additionally, proper data management practices and training for healthcare professionals are necessary to maximize the potential of cloud computing in bioinformatics. Ultimately, the adoption of cloud computing in bioinformatics will pave the way for more personalized and effective healthcare, benefiting patients and healthcare professionals alike.
New Parallel and Distributed Tools and Algorithms for Life Sciences
2020, Future Generation Computer Systems
Computational methods are nowadays ubiquitous in the field of bioinformatics and biomedicine. Besides established fields like molecular dynamics, genomics or neuroimaging, new emerging methods rely heavily on large scale computational resources. These new methods need to manage Tbytes or Pbytes of data with large-scale structural and functional relationships, TFlops or PFlops of computing power for simulating highly complex models, or many-task processes and workflows for processing and analyzing data. Today, many areas in Life Sciences are facing these challenges. This special issue contains papers showing existing solutions and latest developments in Life Sciences and Computing Sciences to collaboratively explore new ideas and approaches to successfully apply distributed IT-systems in translational research, clinical intervention, and decision-making.
Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources
2021, PLoS Computational Biology
GenomicsBench: A Benchmark Suite for Genomics
2021, Proceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021
Development of benchmark automation suite and evaluation of various high-performance computing systems
2021, Cluster Computing

Maximilian Hanussek

• Bachelor of Science (B.Sc.) in Bioinformatics at the University of Tübingen, Germany (2014).

• Master of Science (M.Sc.) in Bioinformatics at the University of Tübingen, Germany (2017).

• Since April 2017 working on my Ph.D. at the University of Tübingen, Germany.

• Since April 2017 responsible for the de.NBI cloud site Tübingen at the computing center of the University of Tübingen (Zentrum für Datenverarbeitung, ZDV), Germany.

• Current working fields are cloud computing combined with high performance storage systems, cloudification of bioinformatic applications and bioinformatic workflows, mass spectrometry, development of virtual clusters and science gateways on demand.

View full text

BOOTABLE: Bioinformatics benchmark tool suite for applications and hardware

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related work

The BOOTABLE benchmark suite

Tools and datasets

Discussion tools and datasets

Conclusions

Declaration of Competing Interest

Acknowledgments

Amazon elastic compute cloud (amazon ec2)

Google cloud computing, hosting services & APIs

Microsoft azure cloud computing platform; services

Bioinformatics in Germany: Toward a national-level infrastructure

Brief. Bioinform.

Überlegungen Zur steuerung einer föderativen infrastruktur am beispiel von bwcloud

Helix Nebula and CERN: A Symbiotic approach to exploiting commercial clouds

J. Phys. Conf. Ser.

A new bioinformatics analysis tools framework at EMBL-EBI

Nucleic Acids Res.

Fold recognition methods

Methods Biochem. Anal.

Molecular dynamics simulations of biomolecules

Nature Struct. Mol. Biol.

Basic local alignment search tool

J. Mol. Biol.

T-coffee: A novel method for fast and accurate multiple sequence alignment

J. Mol. Biol.

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

Rapid and sensitive sequence comparison with FASTP and FASTA

Methods Enzymol.

Maximilianhanussek/BOOTABLE: Initial release

ANSIBLE

Docker

IEEE Softw.

Singularity: Scientific containers for mobility of compute

PLoS One

Openstack

Proc. Inst. Civil Eng. Waste Resour. Manage.

Nmon performance: A free tool to analyze AIX and Linux performance

Common workflow language, vol. 1.0

The sequence of the human genome

Science