A high performance grid-web service framework for the identification of ‘conserved sequence tags’

https://doi.org/10.1016/j.future.2006.07.012Get rights and content

Abstract

The continuous increasing of computing power in biological research places a threshold to the single host use and suggests an approach based on distributed computing. An emerging solution is grid technology, which allows organization to make better use of existing computing resources by providing them with a single, transparent, aggregated source of computing power. Equally, bioinformatics analysis often involves many web services, allowing shared access to information and helping the biologist to design, describe, record complex experiments. A new generation of grid infrastructure, where web services are building blocks, allow managent of a web services workflow.

This work shows a tool for the identification and functional annotation of ‘Conserved Sequence Tags’ (CSTs) through cross-species genome comparisons, deployed on a Grid System Architecture, based on Web Services concepts and technologies.

Introduction

The explosive growth of biological data, stimulated by genome projects, has generated a parallel development of efficient computational approaches suitable for several biological research projects. In this area the need of High Performance Computing (HPC) is growing, though usually not affordably by computational resources of a single research laboratory. Grid computing addresses this problem by coordinating and unifying several computational resources [1], allowing the evaluation and mining of a large amount of data in the terabyte and petabyte range. Unfortunately, present-day versions of grid middleware provide only a small part of the functionality required from a bioinformatics community. On the other hand, web services are the distributed computing technology that offers powerful capabilities for scalable interoperation of heterogeneous software across a wide variety of networked platforms [2]. They give the opportunity to create a framework in which applications distributed across local and wide area networks can interoperate through a set of standard protocols. The crucial difference with the past is that most of the systems consisted of ad hoc solutions (e.g. CGI programs) whereas the web services (WS) should lower the barrier to application integration. To increase individual and collective scientific productivity by making powerful information tools available to everyone, a service-oriented strategy is necessary. New projects on service-oriented grids [3] have the assets of both grid and web service technology and help researchers to obtain high performance web services. Complex applications exchanging huge amounts of data, using several web services, have to be managed to gain high performance and high avalability systems, encouraging convergence of grid and web services.

Among those classes of applications, to face the problem of identifying and assessing the coding or noncoding nature of conserved sequence tags (CSTs) through cross-species genome comparisons [4], [5], [6], we present a grid-web service framework, CSTgrid, whose core is implemented as web services. It is composed of one grid daemon module and by seven web services, three for grid components and four for resource components. CSTgrid web tool, available at http://www.caspur.it/CSTgrid, has been developed as an Open Grid Service Architecture, in which services act as a building block of the grid system, allowing a biology community to use all services without any knowledge of the underlying infrastructure. It can provide high performance, high availability and can fairly handle hundreds of concurrent requests. The grid infrastructure has an ad hoc library, implemented as a set of web services, developed while the grid community is working on a standard toolkit for a service-oriented grid [3]. Furthermore our grid-web service prototype built to minimize the overhead of standard grid toolkit (e.g. Globus toolkit [7]), is based on grid source components developed compliant to Gtk standards, thus permitting an easy migration path to future grid service-oriented standards.

Section snippets

The problem of the identification of conserved sequence tags (CSTs)

The annotation of sequence features in genome tracts is a fundamental task in genome analysis. Although the complete genomes of several eukaryotic organisms have been sequenced, we are not yet able to detect their complete gene inventory, including their regulatory elements [8], [9], [10]. The identification and assessment of the coding or noncoding nature of conserved sequence tags (CSTs) through cross-species genome comparisons may contribute significantly to functional annotation of whole

The software architecture of CSTgrid

The system is developed in multi-layered components to allow a Rapid Application Development (RAD) infrastructure and minimal administration efforts. CSTgrid is logically composed by three tiers (Fig. 1):

(1) An interface tier responsible for communicating with end-user agents such as web browsers and command line clients.

(2) A generic (not oriented to search CSTs) grid tier composed by a grid daemon responsible for the management of the grid resources.

(3) A resource tier composed by a set of

The grid enabled CSTminer

CSTminer is a web tool for the identification and characterization of genome tracts which are highly conserved across species during evolution. It is available at http://www.caspur.it/CSTminer. Such a tool make use of local executables to perform CSTs search and is dynamically interconnected to Ensembl genomes. The system was adequate for few concurrent requests, but in case of multiple concurrent requests the server performance dropped. Furthermore, in case of a failure of some part of the

Benchmarks

In order to study the code scalability as the complexity of the run increases, we setup a computing environment with two DL385 (2x AMD Opteron @ 2.6 GHz, B1 and B2) and two DL585 (4x AMD Opteron @ 2.4 GHz, Q1 and Q2) interconnected via Cu–Gb ethernet (one op between B1–B2/Q1–Q2, 3 ops between B1–Q1/B2–Q2). Although this testbed system was implemented over LAN, the particular configuration we setup for the above multiprocessor SMP machines, has been able to clarify the performance of the CSTgrid

Conclusion

CSTgrid architecture is highly modular allowing an easier development and debugging process. The system has been developed as a Service-Oriented Architecture based on a collection of web services distributed over a geographical grid. It deploys an interface layer, completely independent of an underneath grid-layer. The system has been designed in a user-centric way providing two points of access: the first one is for an end-user to perform high-performance CST serches; the second one is for the

Paolo D’Onorio De Meo graduated with a Bachelors degree in Computer Science from the University of Rome ‘La Sapienza’ in 2004. He has been a bioinformatics developer at the Caspur since 2004.

References (18)

  • D.-C. Li et al.

    Determination of the parameters in the dynamic weighted Round-Robin method for network load balancing

    Computers and Operation Research

    (2005)
  • I. Foster et al.

    The anatomy of the grid: Enabling scalable virtual organizations

    Internationl Journal of Supercomputer Applications

    (2001)
  • Etham Cerami, 2002. Web Services....
  • I. Foster, C. Kesselman, J.M. Nick, S. Tuecke, The physiology of the Grid: An open Grid services architecture for...
  • F. Mignone et al.

    Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis

    Nucleic Acids Research

    (2003)
  • T. Castrignano et al.

    CSTminer: A web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison

    Nucleic Acids Research

    (2004)
  • T. Castrignano et al.

    GenoMiner: A tool for genome wide search of coding and noncoding conserved sequence tags

    Bioinformatics

    (2005)
  • I. Foster et al.

    Globus, A metacomputing infrastructure toolkit

    International Journal of Supercomputer Applications

    (1997)
  • M.A. Nobrega et al.

    Comparative genomic analysis as a tool for biological discovery

    Journal of Physiology

    (2004)
There are more references available in the full text version of this article.

Paolo D’Onorio De Meo graduated with a Bachelors degree in Computer Science from the University of Rome ‘La Sapienza’ in 2004. He has been a bioinformatics developer at the Caspur since 2004.

Danilo Carrabino graduated with a Bachelors degree in Computer Science from the University of Rome ‘La Sapienza’ in 2003. He has been a bioinformatics developer at the Caspur since 2005.

Nico Sanna is chief of the Computational Biology and Chemistry Group at Caspur (Consorsio per le Aplicazioni di Supercalcolo per l’Universita’ e Ricerca). He received his Master in Chemistry in 1990 from the University of Rome ‘La Sapienza’. His primary research interest is in the development of high performance computing applications.

Tiziana Castrignano is bioinformatics specialist at Caspur (Consorsio per le Aplicazioni di Supercalcolo per l’Universita’ e Ricerca). She received her Ph.D. in Biophysics in 1999 from the University of Rome ‘La Sapienza’. Her primary research interest is in the development of high performance bioinformatics services.

Giorgio Grillo is a researcher at the Institute Biomedical Technologies CNR of Bari, Italy. He obtained a degree in Computer Science at University of Bari and since 1994 he has been involved in Bioinformatics, particularly in the development of software for genomics and proteomics. He takes part in many national and international research projects for his expertise in the analysis of biosequences and their functional characterization, in the linguistic and computational analysis in nucleotide sequences. Projects include design, development and implementation of bioinformatic databases largely used by international researchers and in the management of query systems.

Flavio Licciulli obtained a degree in Computer Science at University of Bari in 1992. Since 1994 he has been involved in Bioinformatics with the Bari BioInformatics Group. He is also a researcher at the C.N.R.- Istitute for Biomedical Technologies. His main research activity is design, development, implementation and maintenance of biological databases, development of bioinformatics software and WEB interfaces; he has published over 20 scientific publications in these fields. He is responsible for the database and operating systems management at his institute.

Sabino Liuni is a researcher at the Institute of Biomedical Technology CNR-Bari, Italy. He obtained a degree in Biology in 1984 with a thesis on Bioinformatics. He has focused his interest on bioinformatics. In 1989 he was involved in the management and development of the Italian National EMBnet (European Molecular Biology Network) node. His current interests are mainly on the design and implementation of specialized databases of nucleotide sequences and on the development of algorithms for the analysis of biosequence. He has over 50 publications and is a co-author of widely used computational biology software.

Matteo Re is currently a doctoral student in the Department of Molecular Biosciences and Biotechnology at the University of Milan where he is working on methods for the detection and in-silico characterization of evolutionarily conserved sequences, novel genes and gene isoforms. He was awarded his degree (Molecular Biology, University of Milan) in 2004.

Flavio Mignone received his Degree in Biological Sciences from the University of Milan in 1998. Since 2000 he has worked with the Bioinformatics and Comparative Genomics group headed by Professor G. Pesole at the University of Milan. He received his PhD in Bioinformatics from the University of Milan in 2003; currently he is Computer Science researcher at the same University. Main work area includes study of mRNA untranslated regions (UTRs), identification of regulatory elements and identification of unannotated genes using computational approaches.

Graziano Pesole is full professor of Molecular Biology in the University of Bari. He has since long carried out research in the fields of bioinformatics, comparative genomics and molecular evolution. In particular, his interests are computational approaches for the identification of regulatory elements in noncoding genome regions, alternative splicing and functional analysis of untranslated regions of eukaryotic mRNAs. He has developed a specialized database (UTRdb/UTRsite), largely used by the scientific community, collecting mRNA untranslated sequences and related regulatory motifs involved in the post-transcriptional regulation of gene expression. He has also developed analysis software and several algorithms largely used by the scientific community and available also through web browsers. Within his studies on molecular evolution, he has contributed to the development of new analysis methodologies and has carried out several studies on the evolution of mitochondrial genome at the intraspecies level, in order to clarify some aspects of the origin of modern man, and at the interspecies level to reconstruct mammal phylogeny and to study the evolutionary dynamics of the mitochondrial genome of Tunicata. He leads an interdisciplinary research group including molecular biologists, computer scientists and mathematicians. He coordinated research units in several research project funded by national (MIUR, CNR, Telethon, AIRC) and international (EU, NIH) agencies, and has filed an international patent for the selection of primers for RNA fingerprinting. He is a member of the editorial Board of international journals (GENE, BMC Bioinformatics, BMC Genomics, Computational Biology and Chemistry, Briefings in Bioinformatics), author of over 120 papers published in international journals and co-author of books on Bioinformatics and Genomics published by Italian (Zanichelli, Gnocchi) and international (Wiley) publishers. For further information see http://www.pesolelab.it.

1

Present address: Dipartimento di Biochimica e Biologia Molecolare “E. Quagliariello”, Università di Bari, Italy.

View full text