Using ontologies for preprocessing and mining spectra data on the Grid

doi:10.1016/j.future.2006.04.011

Future Generation Computer Systems

Volume 23, Issue 1, 1 January 2007, Pages 55-60

https://doi.org/10.1016/j.future.2006.04.011 Get rights and content

Abstract

The analysis of mass spectrometry proteomics data requires the composition of different software tools devoted to the loading, management, preprocessing, mining, and visualization of spectra data. This paper proposes the use of ontologies to guide the composition of preprocessing and data mining tools and describes the approach through MS-Analyzer, a software tool for the integrated management, preprocessing and mining of spectra data on the Grid.

Introduction

Mass Spectrometry (MS) proteomics is a powerful technique for identifying different molecular targets in different pathological conditions [1]. Data produced by MS, the spectra, may be represented as a very large set of measures (intensity, $m / z$ ), representing the abundance (intensity) of biomolecules having certain mass-to-charge ratio ( $m / z$ ) values. Such data can be used for various analysis, such as biomarker discovery (i.e. a list of peaks characterizing a disease), peptide/protein identification, and sample classification. Since spectra have a high dimensionality and are often affected by errors and noise, preprocessing techniques are required before the application of any data analysis, and especially before Data Mining (DM).

The increasing use of MS in clinical studies causes the collection of spectra data from large sample populations, e.g. to control the progression of a disease. Moreover, the comparative study of a disease may require the analysis of spectra produced in different laboratories, so it is possible to envision that, in a few years, biomedical researchers will need to collect and analyse more and more data coming from remote proteomics laboratories. Grid technology can be useful for providing efficient storage space for maintaining on-line large spectra datasets, the broadband infrastructure needed to collect in a secure and efficient way proteomics data coming from remote laboratories, and the computational power needed by the preprocessing and data mining algorithms.

On the other hand, the building of such a Virtual Proteomics Laboratory involves different technological platforms related to the various steps of a proteomics experiment, such as sample treatments, MS techniques, spectra processing, data mining analysis, and results visualization. Choosing the right tools requires multidisciplinary knowledge from MS specialists to biologists and computer scientists, thus modelling the semantics of processes, tools, and data sources is a key issue for simplifying the design of applications.

This paper proposes the combined use of workflow techniques and domain ontologies as a semantic guide for experiment formulation, tool exploitation, and application design, i.e. services composition. The proposed WekaOntology and ProtOntology ontologies describe concepts and relationships of data mining, bioinformatics software tools, and data sources. The use of such ontologies to compose applications, and the ability to combine in a different way preprocessing and data mining tools, are described by a case study developed with MS-Analyzer.

MS-Analyzer is a Grid-based Problem Solving Environment [5] that uses a Service Oriented Architecture, i.e. it offers spectra analysis services built as a composition of specialized spectra management and data mining services. Functions are thus implemented as Web/Grid Services, i.e. independent, self-describing programs that interact using the following standards: SOAP (Simple Object Access Protocol) for application invocation, WSDL (Web Services Description Language) for service description, and UDDI (Uniform Description, Discovery and Integration) for service publishing and discovery. Using such an approach, MS-Analyzer is able to offer its spectra analysis services to more users, who can submit their analysis tasks in a concurrent way.

Section snippets

Related works

In the last few years, many systems dealing with ontologies and workflows, as well as spectra management, have been developed but, to the best of our knowledge, none of these systems implement a complete knowledge discovery process for the analysis of mass spectra and biological information extraction. Systems like SpecAlign [15], MSAnalyzer (http://sashimi.sourceforge.net/, please note the name similarity with MS-Analyzer, the system proposed here), and those developed in [7], are all

Mass spectrometry data analysis

Mass spectrometry [1] is a technique that is used to identify macromolecules such as proteins or peptides in a compound. The mass spectrometer separates gas-phase ions according to their $m / z$ (mass-to-charge ratio) values. Commonly used MS techniques are SELDI (Surface Enhanced Laser Desorption/Ionization) and MALDI (Matrix-Assisted Laser Desorption/Ionization) TOF (Time Of Flight), and Liquid-Chromatography tandem mass spectrometry (LC-MS/MS).

Although each instrument usually produces data in a

Ontologies and mass spectrometry analysis

Designing a data mining application for the analysis of MS data involves different experts and requires the contemporary use of different knowledge, among the following: (i) basic concepts of MS, related to the content and the format of spectra generated by different MS techniques; (ii) concepts of data management, related to the efficient organization, retrieval, and preprocessing of spectra; (iii) concepts of knowledge discovery, related to the different available data mining algorithms, to

Ontology-based application design using MS-Analyzer

MS-Analyzer is a software platform for bioinformatics applications that allows the integrated preprocessing, management and data mining analysis of MS proteomics data. MS-Analyzer provides various services that implement spectra management and preprocessing, as well as data mining and visualization functions; the latter have been obtained by wrapping Weka tools [14]. In particular, management, preprocessing and preparation services, respectively, implement the format conversion and efficient

References (16)

M. Bubak et al.
Workflow composer and service registry for grid applications
Future Gener. Comput. Syst.
(2005)
M. Cannataro et al.
Distributed data mining on the grid
Future Gener. Comput. Syst.
(2002)
J. Cunha et al.
Future trends in distributed applications and problem-solving environments
Future Gener. Comput. Syst.
(2005)
R. Aebersold et al.
Mass spectrometry-based proteomics
Nature
(2003)
M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Preprocessing of mass spectrometry proteomics data on the...
N. Goodman et al.
The labbase system for data management in large scale biology research laboratories
Bioinformatics
(1998)
N. Jeffries
Algorithms for alignment of mass spectrometry proteomic data
Bioinformatics
(2005)
A. Laud, S. Bhowmick, P. Cruz, D. Singh, G. Rajesh, The grna: A highly programmable infrastructure for prototyping,...

There are more references available in the full text version of this article.

Cited by (28)

Management and analysis of biological and clinical data: How computer science may support biomedical and clinical research
2015, Physics Procedia
The use of computer based solutions for data management in biology and clinical science has contributed to improve life-quality and also to gather research results in shorter time. Indeed, new algorithms and high performance computation have been using in proteomics and genomics studies for curing chronic diseases (e.g., drug designing) as well as supporting clinicians both in diagnosis (e.g., images-based diagnosis) and patient curing (e.g., computer based information analysis on information gathered from patient).
In this paper we survey on examples of computer based techniques applied in both biology and clinical contexts. The reported applications are also results of experiences in real case applications at University Medical School of Catanzaro and also part of experiences of the National project Staywell SH 2.0 involving many research centers and companies aiming to study and improve citizen wellness.
Advanced computing solutions for health care and medicine
2012, Journal of Computational Science
Citation Excerpt :
We claim that still many efforts should be done both in designing new architectural solutions for ICT and HPC applied to medicine, and for new algorithms allowing to support physicians in diagnosis and interventional phases. Among those, the integration of collaborative tools resulting in the so called “collaboratories” [11] and the use of ontologies to model application domains [12], are two key issues for the effective application of HPC to medicine and biology. We really enjoyed in preparing such issue, and we hope readers will find interesting papers and useful topics for their research.
This guest editorial introduces the special issue on “Advanced Computing Solutions for Health Care and Medicine”. The goal of this special issue was to collect high quality papers describing the application of computer science methods and techniques to main health care and clinical problems, resulting in high performance applications or prototypes for medical and clinical environments. The special issue touched different health informatics hot topics and is organized in four sections: (i) clinical decision support systems; (ii) biomedical imaging; (iii) high performance computing and biomedical simulations; (iv) bioinformatics data analysis.
IMPRECO: Distributed prediction of protein complexes
2010, Future Generation Computer Systems
Citation Excerpt :
Future work will focus on the complete realization of the distributed modules in a Grid environment and to estimate the improvement of integration times for bigger networks such as human. Moreover we plan to use biological knowledge to guide the algorithm of integration, that is a trend in bioinformatics [31]. For instance we plan to use ontologies, e.g. Gene Ontology [32], to improve the selection of complexes.
Proteins interact among themselves, and different interactions form a very huge number of possible combinations representable as protein-to-protein interaction (PPI) networks that are mapped into graph structures. Protein complexes are a subset of mutually interacting proteins. Starting from a PPI network, protein complexes may be extracted by using computational methods. The paper proposes a new complexes meta-predictor which is capable of predicting protein complexes by integrating the results of different predictors. It is based on a distributed architecture that wraps predictor as web/grid services that is built on top of the grid infrastructure. The proposed meta-predictor first invokes different available predictors wrapped as services in a parallel way, then integrates their results using graph analysis, and finally evaluates the predicted results by comparing them against external databases storing experimentally determined protein complexes.
Using ontologies for querying and analysing protein-protein interaction data
2010, Procedia Computer Science
Biological function is to a large extent mediated and controlled by interactions among proteins. The study of interactions about proteins has lead to the accumulation of a large amount of data, also referred to as protein-protein interaction (PPI) data. Such data, stored in publicly available databases, are often queried by using simple keybased query interfaces with little semantic. Current PPI databases enable the retrieval of one or more proteins that interact with a target protein using target protein identifier. Nevertheless, a lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing PPI databases with biological information may result in richer querying interface and successively could enable the development of novel algorithms that may use such biological information. The main contributions of this paper are: (i) a framework able to extend existing PPI databases by using ontologies, and (ii) a semantic based querying interface. The framework merges PPI data with annotations extracted from Gene Ontology and stores annotated data into a database. Then, a semantic-based query interface enables users to query these data by using biological concepts. Finally, a real case study showing the effectiveness of such framework on the analysis of PPI data is also presented.
Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case
2024, BMC Medical Informatics and Decision Making
Bowlership: Examining the Existence of Bowler Synergies in Cricket
2024, Studies in Computational Intelligence

View all citing articles on Scopus

Mario Cannataro has been an Associate Professor of Computer Science at the University Magna Græcia of Catanzaro, Italy, since November 2002. He received his Laurea Degree cum Laude in Computer Engineering from the University of Calabria, Rende, Italy, in 1993. He has worked on parallel computing, massively parallel architectures, parallel implementation of logic programs and cellular automata. His current research interests are in grid computing, and grid-based problem-solving environments for bioinformatics, proteomics, ontologies, and adaptive hypermedia systems. He has authored a book on the parallel implementation of logic programs and over 100 scientific papers published in international journals and conference proceedings. He is a member of ACM, ACM Sigmod, and IEEE Computer Society, and he serves as a program committee member for several international conferences. He is a co-founder and a member of Exeura (www.exeura.com). Mario Cannataro can be reached at [email protected].

Pietro Hiram Guzzi received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. Currently, he has been pursuing his Ph.D. in Informatics and Biomedical Engineering at the University Magna Græcia of Catanzaro, Italy, since November 2004. His research interests comprise bioinformatics, the analysis of proteomics data, and the distributed management of ontologies. Pietro H. Guzzi can be reached at [email protected].

Tommaso Mazza received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. Currently, he is pursuing his Ph.D. in Informatics and Biomedical Engineering at the University Magna Græcia of Catanzaro, Italy, since November 2004. His research interests comprise web/grid services, workflow technology, and statistical and data mining algorithms for the analysis of mass spectra and biomedical data. Tommaso Mazza can be reached at [email protected].

Giuseppe Tradigo received his Laurea degree in Computer Engineering from the University of Calabria, Rende, Italy, in 2003. Currently he is working with the bioinformatics group of University “Magna Græcia” of Catanzaro. His research interests are in geographical databases and GIS, grid computing, software engineering, protein structure prediction, and grid environments for bioinformatics applications. Giuseppe Tradigo can be reached at [email protected].

Pierangelo Veltri is an Assistant Professor of Computer Science at the University Magna Græcia of Catanzaro, Italy. He received his Ph.D. in Computer Science in 2002, at INRIA Rocquencourt, France, and obtained his Laurea degree in Computer Engineering in 1998 at University of Calabria, Italy. From 1998 to 2000 he was a member of the Verso group and participated in the Chorochronous project (1998–1999) and the Xyleme project (1999–2001). Since 2002 he has participated in several projects, whose main topics are computer science (such as Grid computing, geographical and temporal databases, and XML views) and bioinformatics (spectra data analysis for cancer detection, proteomics, and prediction of protein structures). His main research interests are: XML and semi-structured database, data modelling, spatial and multidimensional databases, protein matching structure, proteins structure prediction, and data management on grid computing. Pierangelo Veltri can be reached at [email protected].

View full text

Using ontologies for preprocessing and mining spectra data on the Grid

Abstract

Introduction

Section snippets

Related works

Mass spectrometry data analysis

Ontologies and mass spectrometry analysis

Ontology-based application design using MS-Analyzer

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Mass spectrometry-based proteomics

Nature

The labbase system for data management in large scale biology research laboratories

Bioinformatics

Algorithms for alignment of mass spectrometry proteomic data

Bioinformatics