Using ontologies for preprocessing and mining spectra data on the Grid

https://doi.org/10.1016/j.future.2006.04.011Get rights and content

Abstract

The analysis of mass spectrometry proteomics data requires the composition of different software tools devoted to the loading, management, preprocessing, mining, and visualization of spectra data. This paper proposes the use of ontologies to guide the composition of preprocessing and data mining tools and describes the approach through MS-Analyzer, a software tool for the integrated management, preprocessing and mining of spectra data on the Grid.

Introduction

Mass Spectrometry (MS) proteomics is a powerful technique for identifying different molecular targets in different pathological conditions [1]. Data produced by MS, the spectra, may be represented as a very large set of measures (intensity, m/z), representing the abundance (intensity) of biomolecules having certain mass-to-charge ratio (m/z) values. Such data can be used for various analysis, such as biomarker discovery (i.e. a list of peaks characterizing a disease), peptide/protein identification, and sample classification. Since spectra have a high dimensionality and are often affected by errors and noise, preprocessing techniques are required before the application of any data analysis, and especially before Data Mining (DM).

The increasing use of MS in clinical studies causes the collection of spectra data from large sample populations, e.g. to control the progression of a disease. Moreover, the comparative study of a disease may require the analysis of spectra produced in different laboratories, so it is possible to envision that, in a few years, biomedical researchers will need to collect and analyse more and more data coming from remote proteomics laboratories. Grid technology can be useful for providing efficient storage space for maintaining on-line large spectra datasets, the broadband infrastructure needed to collect in a secure and efficient way proteomics data coming from remote laboratories, and the computational power needed by the preprocessing and data mining algorithms.

On the other hand, the building of such a Virtual Proteomics Laboratory involves different technological platforms related to the various steps of a proteomics experiment, such as sample treatments, MS techniques, spectra processing, data mining analysis, and results visualization. Choosing the right tools requires multidisciplinary knowledge from MS specialists to biologists and computer scientists, thus modelling the semantics of processes, tools, and data sources is a key issue for simplifying the design of applications.

This paper proposes the combined use of workflow techniques and domain ontologies as a semantic guide for experiment formulation, tool exploitation, and application design, i.e. services composition. The proposed WekaOntology and ProtOntology ontologies describe concepts and relationships of data mining, bioinformatics software tools, and data sources. The use of such ontologies to compose applications, and the ability to combine in a different way preprocessing and data mining tools, are described by a case study developed with MS-Analyzer.

MS-Analyzer is a Grid-based Problem Solving Environment [5] that uses a Service Oriented Architecture, i.e. it offers spectra analysis services built as a composition of specialized spectra management and data mining services. Functions are thus implemented as Web/Grid Services, i.e. independent, self-describing programs that interact using the following standards: SOAP (Simple Object Access Protocol) for application invocation, WSDL (Web Services Description Language) for service description, and UDDI (Uniform Description, Discovery and Integration) for service publishing and discovery. Using such an approach, MS-Analyzer is able to offer its spectra analysis services to more users, who can submit their analysis tasks in a concurrent way.

Section snippets

Related works

In the last few years, many systems dealing with ontologies and workflows, as well as spectra management, have been developed but, to the best of our knowledge, none of these systems implement a complete knowledge discovery process for the analysis of mass spectra and biological information extraction. Systems like SpecAlign [15], MSAnalyzer (http://sashimi.sourceforge.net/, please note the name similarity with MS-Analyzer, the system proposed here), and those developed in [7], are all

Mass spectrometry data analysis

Mass spectrometry [1] is a technique that is used to identify macromolecules such as proteins or peptides in a compound. The mass spectrometer separates gas-phase ions according to their m/z (mass-to-charge ratio) values. Commonly used MS techniques are SELDI (Surface Enhanced Laser Desorption/Ionization) and MALDI (Matrix-Assisted Laser Desorption/Ionization) TOF (Time Of Flight), and Liquid-Chromatography tandem mass spectrometry (LC-MS/MS).

Although each instrument usually produces data in a

Ontologies and mass spectrometry analysis

Designing a data mining application for the analysis of MS data involves different experts and requires the contemporary use of different knowledge, among the following: (i) basic concepts of MS, related to the content and the format of spectra generated by different MS techniques; (ii) concepts of data management, related to the efficient organization, retrieval, and preprocessing of spectra; (iii) concepts of knowledge discovery, related to the different available data mining algorithms, to

Ontology-based application design using MS-Analyzer

MS-Analyzer is a software platform for bioinformatics applications that allows the integrated preprocessing, management and data mining analysis of MS proteomics data. MS-Analyzer provides various services that implement spectra management and preprocessing, as well as data mining and visualization functions; the latter have been obtained by wrapping Weka tools [14]. In particular, management, preprocessing and preparation services, respectively, implement the format conversion and efficient

Mario Cannataro has been an Associate Professor of Computer Science at the University Magna Græcia of Catanzaro, Italy, since November 2002. He received his Laurea Degree cum Laude in Computer Engineering from the University of Calabria, Rende, Italy, in 1993. He has worked on parallel computing, massively parallel architectures, parallel implementation of logic programs and cellular automata. His current research interests are in grid computing, and grid-based problem-solving environments for

References (16)

There are more references available in the full text version of this article.

Cited by (28)

  • Advanced computing solutions for health care and medicine

    2012, Journal of Computational Science
    Citation Excerpt :

    We claim that still many efforts should be done both in designing new architectural solutions for ICT and HPC applied to medicine, and for new algorithms allowing to support physicians in diagnosis and interventional phases. Among those, the integration of collaborative tools resulting in the so called “collaboratories” [11] and the use of ontologies to model application domains [12], are two key issues for the effective application of HPC to medicine and biology. We really enjoyed in preparing such issue, and we hope readers will find interesting papers and useful topics for their research.

  • IMPRECO: Distributed prediction of protein complexes

    2010, Future Generation Computer Systems
    Citation Excerpt :

    Future work will focus on the complete realization of the distributed modules in a Grid environment and to estimate the improvement of integration times for bigger networks such as human. Moreover we plan to use biological knowledge to guide the algorithm of integration, that is a trend in bioinformatics [31]. For instance we plan to use ontologies, e.g. Gene Ontology [32], to improve the selection of complexes.

View all citing articles on Scopus

Mario Cannataro has been an Associate Professor of Computer Science at the University Magna Græcia of Catanzaro, Italy, since November 2002. He received his Laurea Degree cum Laude in Computer Engineering from the University of Calabria, Rende, Italy, in 1993. He has worked on parallel computing, massively parallel architectures, parallel implementation of logic programs and cellular automata. His current research interests are in grid computing, and grid-based problem-solving environments for bioinformatics, proteomics, ontologies, and adaptive hypermedia systems. He has authored a book on the parallel implementation of logic programs and over 100 scientific papers published in international journals and conference proceedings. He is a member of ACM, ACM Sigmod, and IEEE Computer Society, and he serves as a program committee member for several international conferences. He is a co-founder and a member of Exeura (www.exeura.com). Mario Cannataro can be reached at [email protected].

Pietro Hiram Guzzi received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. Currently, he has been pursuing his Ph.D. in Informatics and Biomedical Engineering at the University Magna Græcia of Catanzaro, Italy, since November 2004. His research interests comprise bioinformatics, the analysis of proteomics data, and the distributed management of ontologies. Pietro H. Guzzi can be reached at [email protected].

Tommaso Mazza received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. Currently, he is pursuing his Ph.D. in Informatics and Biomedical Engineering at the University Magna Græcia of Catanzaro, Italy, since November 2004. His research interests comprise web/grid services, workflow technology, and statistical and data mining algorithms for the analysis of mass spectra and biomedical data. Tommaso Mazza can be reached at [email protected].

Giuseppe Tradigo received his Laurea degree in Computer Engineering from the University of Calabria, Rende, Italy, in 2003. Currently he is working with the bioinformatics group of University “Magna Græcia” of Catanzaro. His research interests are in geographical databases and GIS, grid computing, software engineering, protein structure prediction, and grid environments for bioinformatics applications. Giuseppe Tradigo can be reached at [email protected].

Pierangelo Veltri is an Assistant Professor of Computer Science at the University Magna Græcia of Catanzaro, Italy. He received his Ph.D. in Computer Science in 2002, at INRIA Rocquencourt, France, and obtained his Laurea degree in Computer Engineering in 1998 at University of Calabria, Italy. From 1998 to 2000 he was a member of the Verso group and participated in the Chorochronous project (1998–1999) and the Xyleme project (1999–2001). Since 2002 he has participated in several projects, whose main topics are computer science (such as Grid computing, geographical and temporal databases, and XML views) and bioinformatics (spectra data analysis for cancer detection, proteomics, and prediction of protein structures). His main research interests are: XML and semi-structured database, data modelling, spatial and multidimensional databases, protein matching structure, proteins structure prediction, and data management on grid computing. Pierangelo Veltri can be reached at [email protected].

View full text