1 Introduction

In the past decade, the huge generation and availability of array-based and DNA sequencing technologies data has enabled the generation of large quantities of diverse omic data (Tursz et al. 2011; Volinia et al. 2010; Nicoloso et al. 2010). The analysis and interpretation of this diverse data has pushed for a change in research modalities and towards interdisciplinary research. This is enhancing our understanding of cancer biology in health and disease. In recent years, advanced semi-interactive data analysis algorithms, such as those from the field of data mining, gained an increasing importance in life science in general and in particular in genetics and medicine. Recent literature includes a lot of examples of the application of data mining in these fields (Almansoori et al. 2012; Zhou et al. 2012; Kl and Tan 2012). Although many steps needs yet to be made, today there is a trend towards extensively collecting data from different sources in repositories potentially useful for subsequent analysis. At the time the data are collected they are analysed in a specific context which influences the experimental design; however, the type of analyses that the data will be used for after they have been deposited is not known. Content and data format are focused only to the first experiment, but not to the future re-use. Thus, complex process chains are needed for the preparation and collection of relevant data for future analyses. Such process chains need to be supported by the environments that are used to set-up analysis solutions. Building specialized software is not a solution, as this effort can only be carried out for huge projects running for several years. Hence, data mining functionality was developed to tool-kits, which provide data mining functionality in form of a collection of different components. Depending on the different research questions of the users, the solutions consist of distinct compositions of these components.

Today, existing solutions for data mining processes comprise different components that represent different steps in the analysis process. There exist graphical or script-based tool-kits for combining such components. Classical data mining tools, which can serve as components in analysis processes, are based on single computer environments, local data sources and single users. However, analysis scenarios in medical- and bioinformatics have to deal with multi computer environments, distributed data sources and multiple users that have to cooperate. Users need support for integrating data mining into analysis processes in the context of such scenarios, which lacks today. Typically, analysts working with single computer environments face the problem of large data volumes since tools do not address scalability and access to distributed data sources. Distributed environments provide scalability and access to distributed data sources, but the integration of existing components into such environments is complex. In addition, new components often cannot be directly developed in distributed environment. Moreover, in scenarios involving multiple computers, multiple distributed data sources and multiple users, reuse of components and analysis processes becomes more important as more steps and configuration and thus much bigger efforts are needed to develop and set-up a solution.

In this paper, we introduce the field of scientific data analysis in bio- and medical informatics and present some of today’s typical analysis scenarios in bioinformatics, including the roles of the user groups involved, the data sources that are available and the analysis processes that are set up. Subsequently, we present the challenges for data analysis processes in today’s health information systems in the context of personalized medicine. Based on the challenges and requirements, we present the building blocks that can serve as basis for the development of the data mining environment in a system for personalized medicine. Then, we will go into details on the approach of data mining process patterns for reuse. Finally, we present a case study of a bioinformatics scenario. This paper extends (Wegener et al. 2011) by a detailed description of the data mining process pattern approach including a case study and contains content from Wegener (2012).

2 Scientific data analysis in bio- and medical informatics

Bioinformatics is conceptualizing biology in terms of macromolecules and applying information technology techniques from applied maths, computer science and statistics to understand and organize the information associated with these macromolecules (Luscombe et al. 2001). Typical research questions in bioinformatics are, e.g., finding predictive or prognostic biomarkers, defining subtypes of diseases, classifying samples by using gene signals, annotations, etc. In order to answer such questions, bioinformaticians, statisticians, medics and biologists combine different heterogeneous data sources from private or public repositories, and they apply, or if needed develop and then apply, different analysis methods to the information extracted from the repositories and interpret the results until they have found good combinations of data sources and analysis methods. This process can be short or long, straightforward or complex, depending on the nature of the data and questions. This is what we will call here a scenario.

In the following, we will describe some of the data sources and repositories, techniques and analysis processes and user groups that are typically involved in bioinformatics scenarios.

2.1 Data sources

Bioinformatics is an area in which analysis scenarios include huge amounts and different types of data. Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences and the results of functional genomics experiments such as expression data (Luscombe et al. 2001) and others like those available at http://www.cancergenome.nih.gov.

Recent advances in technology enable collecting data at more and more detailed levels (Roos 2001): organism, organ, tissue up to cellular and even sub-cellular level (Luscombe et al. 2001; Soinov 2006). In detail, this includes the following:

  • Organism level: an organism is the biological system in its wholeness, typically including a group of organs. Organism level related data is the clinical data, which usually comes from the hospital database manager.

  • Organ level: an organ is a group of tissues that together perform a complex function. Organ level related data usually comes from the pathologist.

  • Tissue level: tissues are groups of similar cells specialized for a single function. Tissue level data usually come from the pathologist.

  • Cellular level: in a multi-cellular organism such as a human, different types of cells perform different tasks. Cellular level related data usually are organized by the lab manager;

  • Sub-cellular level: data at the sub-cellular level are composed by the structures that compose the cells. Usually, a data analyst can retrieve this data by performing ontologies analyses.

Additional information includes the content of scientific papers and "relationship data" from metabolic pathways, taxonomy trees and protein–protein interaction networks (Luscombe et al. 2001). Comprehensive meta-data describing the semantics of the heterogeneous and complex data are needed to leverage it for further research (Weiler et al. 2007). To address this issue, efforts exist for describing the data in a comprehensive way by domain-specific ontologies (Brochhausen et al. 2011).

Data from different sources (partially publicly available) are extensively collected in repositories potentially useful for subsequent analysis (Luscombe et al. 2001; Roos 2001; Soinov 2006). Common public repositories include, e.g., the following:

2.2 Techniques and user groups

Bioinformatics employs a wide range of techniques from maths, computer science and statistics, including sequence alignment, database design, data mining, prediction of protein structure and function, gene finding, expression data clustering, which are applied to heterogeneous data sources (Luscombe et al. 2001). Bioinformatics is a collaborative discipline (Roos 2001). Bioinformaticians of today are highly qualified and specialized people from various backgrounds such as data mining, mathematics, statistics, biology, IT development, etc. and a typical analysis scenario involves multiple users and experts from different departments or organizations. Bioinformaticians are often working together with different collaborators, very schematically; these can be the following:

  • IT people: they might support bioinformaticians by providing and helping with the needed computational power, network infrastructure and data sharing.

  • Clinicians: they are often a key point for patient’s information access and for the design and planning of the clinical part of the experiment.

  • Pharmaceuticals Companies: they might be interested in discoveries that have a commercial potential, typically at the end of the research project.

  • Statisticians: they can provide help on designing the study and correctly analysing the data.

  • Biologists: they can provide help on designing the experiment and correctly interpret the data. They can also be key people for managing the clinical samples.

2.3 Analysis processes

The common procedure for data analysis for scenarios from bioinformatics can be described in an abstract way as follows:

  • Design of the experiment with the collaborators involved and understand the methods and data needed.

  • Based on the research question, data of different types are acquired from data repositories.

  • Based on the research question, the methods are gathered.

  • If method and data are readily available the process can start; otherwise, more collection/development/implementation is needed and the process is temporarily halted until new data or tools are available.

  • For each data of a certain type there is a pre-processing done.

  • The data are merged.

  • The analysis is performed.

  • The results are discussed with the collaborators.

  • The whole process can be iterated if new hypotheses are generated.

Figure 1 visualizes such a common analysis process from an abstract point of view. Analysis processes involve both manual and automated steps. Results of the analysis processes have to be interpreted to use them, e.g., for support to the clinical decision making process (Rossi et al. 2010; Planche et al. 2011).

Fig. 1
figure 1

Common procedure in scenarios from bioinformatics

When composing a solution to an analysis problem, bioinformaticians mainly work together with biologists and clinicians to provide the best possible solution to the project questions. The solution is typically composed of different components. Many of them are recycled from previous solutions and often need to be adapted. Other components will be designed and developed from scratch.

Implementation of bioinformatics scenarios is typically done with tools chosen as most appropriate by the bioinformaticians. Due to the various backgrounds, there is a quite heterogeneous set of tools and languages in use. Thus, analysis processes can be very different, depending of the type of data, the technology used, the tools used, the aim of the study, etc. But, in the biomedical field, some steps are common: quality control, normalization, filtering, visualization, find differentially expressed entities.

3 Challenges and requirements

Today’s data analysis scenarios in bioinformatics face the following challenges:

Heterogeneous group of users in different locations

In today’s bioinformatics scenarios, users working at different locations have to collaborate. As a proof we can easily check the affiliations of the authors in a PubMed (pubmed:http://www.ncbi.nlm.nih.gov/pubmed/) paper. Bioinformaticians of today are from various backgrounds such as data-mining, mathematics, statistics, biology, IT development, etc. Thus, the scenarios involve a heterogeneous and distributed group of users. Depending on their background, knowledge and type of job, users can interact with an analysis environment in a different way and use different tools. For instance, some bioinformatics people might want to configure and run predefined workflows via simple form-based web pages. Other users might want to design new workflows based on existing components or reuse workflows from colleagues or they might want to develop new components by just writing their analysis algorithms in their own language of choice or use software from colleagues and might want to integrate them into the system by writing a plug-in module for the code to run within the environment. Advanced users, e.g., might even want to partially modify the structure of the workflow environment. When multiple users work together at different locations and with different background, the set of tools used is also quite heterogeneous. However, the users typically do not have an overview over the full system and no detailed knowledge about all parts of the system. This is especially true if they are involved in huge projects, such as the developers and curators of UCSC (http://genome.ucsc.edu/), Ensembl (http://www.ensembl.org/index.html) or TCGA (http://cancergenome.nih.gov/).

Large, heterogeneous and distributed data sources

Today, data are still mainly collected and evaluated with focus on a specific problem or study, but it is more and more often re-used for further studies in the bioinformatics and healthcare wider domain. Thus, data are extensively collected from different sources in repositories potentially useful for subsequent analysis. As the type of analysis is not yet known at the time the data are collected in these repositories, content and data format have some basic requirements but are not focused. Moreover, recent advances in technology allow for collecting data on more detailed levels. Thus, the volume of data can become very large. In analysis scenarios in the context of bioinformatics, several different data and data types are involved. People with different responsibilities and analysis questions work with different sets of data sources. The corresponding data sources are distributed by nature. There exist a large number of public data sources and repositories that are accessible via the internet. In addition, private data sources are distributed across several departments of a hospital or institute, or even across different hospitals or institutes. As a result, a huge amount of distributed data is available for usage. For these reasons the scenarios involve an increasing number of data sources and amount of data. Typically, bioinformatics scenarios include the development of a solution based on a certain restricted data repository and the evaluation on public available data or vice-versa. The semantic of the datasets is complex and needs to be described to allow a proper usage. Due to the heterogeneity and complexity of the data, several domain-specific ontologies exist for the description of the semantics of the data by comprehensive meta-data.

Multi computer environments

Today’s analysis scenarios have to deal with distributed and heterogeneous users as well as distributed and heterogeneous data sources. Instead of single-computer environments or environments hosted inside a certain organization, the scenarios involve users working with different tools and distributed data sources managed in different systems spread over the globe. In addition, today’s data analysis applications in bioinformatics increase in complexity and in their demand for resources. To address this issue, solutions can be integrated into distributed environments that provide computing resources and allow for scalability, like for example the analysis of deep sequencing data (Hawkins et al. 2010).

Complex process chains

Content and data format of the data collected in the area of medicine and bioinformatics are not focused on a fixed problem or research question, but they continuously change to adapt to the new needs. Thus, complex process chains are needed for the analysis of the data. Building specialized software for each analysis problem that is going to be addressed tends to be the current solution but this is not ideal as this effort can only be carried out for huge projects running for several years. Thus, such process chains need to be supported by the environments that are used to set-up analysis solutions.

4 p-medicine approach

In this section, we will introduce our approach towards the data mining environment in the p-medicine system that aims at addressing the requirements presented above. First, we will present some lessons learned from prior projects. Second, we will show the building blocks that are foreseen for the data mining environment.

4.1 Lessons learned from prior projects

In Rüping et al. (2010) and Bucur et al. (2011) we presented some lessons learned from building a data mining environment in the ACGT project (Advancing Clinico Genomic Trials on Cancer, http://eu-acgt.org), which had the goal of implementing a secure, semantically enhanced end-to-end system in support of large multi-centric clinico-genomic trials. The various elements of the data mining environment can be integrated into complex analysis pipelines through the ACGT workflow editor and enactor. In the following, we will summarize the lessons learned.

As the construction of a good data mining process requires encoding a significant amount of domain knowledge, it cannot be fully automated. By reusing and adapting existing processes that have proven to be successful, we hope to be able to save much of this manual work in a new application and thereby increase the efficiency of setting up data mining workflows. While a multitude of tools for data mining, bioinformatics and statistics on clinical data exists, the question of quality control and standardization, as well as ease of use and reusability, remains largely unanswered yet.

With respect to workflow reuse, we had the following experience in setting up and running an initial version of the ACGT environment:

  • The construction of data mining workflows is an inherently complex problem when it is based on input data with complex semantics, as it is the case in clinical and genomic data.

  • Because of the complex data dependencies, copy and paste is not an appropriate technique for workflow reuse.

  • Standardization and reuse of approaches and algorithms works very well on the level of services, but not on the level of workflows. While it is relatively easy to select the right parametrization of a service, making the right connections and changes to a workflow template produced by a third party is quickly getting quite complex, such that often the user finds it easier to construct a new workflow from scratch.

  • Workflow reuse only occurs when the initial creator of a workflow describes the internal logic of the workflow in large detail. However, most creators avoid this effort because the large-scale re-use of workflows is a relatively new and not regulated activity, tools that facilitate workflow annotation are yet lacking and because this requires human resources that are not always available.

In order to be able to meaningfully reuse data mining workflows, a flexible but formal notation is needed that allows expressing both technical information about the implementation of workflows and high-level semantic information about the purpose and pre-requisites of a workflow.

In summary, the situation of having a large repository of workflows to choose the appropriate one from, which is often assumed in existing approaches for workflow recommendation systems, may not be very realistic in practice.

4.2 Building blocks for the data mining environment

We identified a set of building blocks that can serve as basis for the p-medicine data mining environment:

  • Reusing available components: a method for the integration and reuse of data mining components that have been developed in a single computer environment into distributed environments.

  • Developing new components: a method for interactive development of data mining components in distributed environments.

  • Reusing existing analysis processes: a method for the integration and reuse of data mining based analysis processes that involve several analysis steps.

  • GUI and system interfaces: interfaces that address different levels of granularity for users to work with the system or to extend the system.

In the following, we will describe the building block in more details.

4.2.1 Reusing available data mining components

To support users in using standard data mining components and other available components with small effort there is a need for an approach to integrate data mining components being developed for a single processor environment into a distributed environment. We assume that there is not yet an existing comprehensive solution for the data mining problem, but that it can be solved by using and correctly composing available data mining components.

In the DataMiningGrid (http://www.datamininggrid.org/) and in the ACGT project, approaches and infrastructure principles for the integration of the data mining components into distributed environments have been contributed (Bucur et al. 2011; Stankovski et al. 2008). In Stankovski et al. (2008) a meta-data schema definition (Application Description Schema), which is used to grid-enable existing data mining components, was presented as a solution. The ADS is used to manage user interaction with system components to grid-enable existing data mining components, to register and search for available data mining components on the grid, to match analysis jobs with suitable computational resources and to dynamically create user interfaces. The approach allows for integration by users without deeper knowledge on the underlying distributed systems and without any intervention on the application side and thus addresses the needs of the community to support users in using standard data mining tools and available components.

The GridR service (Bucur et al. 2011) allows for reusing R script-based data mining components. The underlying method of GridR reduces the complexity of integrating and handling analysis scripts in distributed environments. Instead of registering each single application as separate component in the environment, the method is technically based on a single service with complex inputs and outputs that allows for providing the algorithm as parameter.

4.2.2 Developing new data mining components

In addition to reusing components from single computer environments, users like bioinformaticians and biostatisticians typically need to interactively develop data mining components in the analysis environment interactively to allow for combining information from different data sources and applying different methodologies to the information extracted from these repositories.

In the ACGT project, a method for interactive development based on novel infrastructure principles that allow for profiting from the functionality and support of standardized tools and environments was contributed (Bucur et al. 2011). The approach supports the development of data mining solutions by the integration of data mining scripts into complex analysis systems and their processes. The GridR toolkit (Wegener et al. 2009) is based on the approach for interactively developing scripts consisting of data mining components in distributed environments. In addition to providing a single service as interface for the execution of data mining scripts, the method allows for interactively developing data mining scripts in eScience environments based on extensions to the R environment that interface with the API of middleware components of distributed systems. The approach efficiently supports users when it is necessary to enhance available or develop new data mining components. Users are enabled to interactively develop data mining based data analysis processes directly within a distributed environment.

4.2.3 Reusing existing analysis processes by data mining process patterns

In today’s analysis solutions in bioinformatics, complex process chains have to be set-up. The composition of such process chains is a huge effort. Thus, reuse of processes becomes much more important. However, analysis processes often cannot be used directly, as they are customized to a certain analysis question and the information on how the process was set-up and which requirements have to be met for applying the process is often not available. In Wegener and Rüping (2011) we contributed the concept of data mining process patterns. Data mining process patterns allow for facilitating the integration and reuse of data mining in analysis processes. The underlying approach is based on encoding requirements and pre-requisites inside the analysis process and a task hierarchy that allows for generalizing and concreting tasks for the creation and application of process patterns. The data mining process pattern approach supports users in scenarios that cover different steps of the data mining process or involve several analysis steps. Data mining process patterns support the description of data mining processes at different levels of abstraction between the CRISP model (Shearer 2000) as most general and executable workflows as most concrete representation. Hence, they allow for easy reuse and integration of data mining processes.

4.2.4 GUI and system interfaces

Today’s environments for data mining in the context of bioinformatics scenarios have to support users to work with the system or to extend the system in different levels of granularity One of the reasons why a lot of current tools are not used by bioinformaticians is that they are a black box, i.e., it is not easy to modify the tools to new situations and requirements. There is a need for an open system which can be accessed in layers, depending on the wish of the user.

In conclusion, from our experience a system is needed that allows the IT-wise people to modify it, allows mathematicians and statisticians to plug in their models easily and allows the biologist and clinicians not to see the analysis algorithms but still to understand what has been used and why, as they will need to report it and justify it. Other fundamental features would be that once a workflow is created, it can be used on data external to the specific repository and that an interface can be easily created for each workflow and customized by the bioinformatician. This could be for example a web-page that would contain links for the input and the output. This is fundamental for the clinicians to be able to use a workflow, but it is also fundamental that it can be customized, as techniques in genomics and molecular biology are continuously changing with new requirements.

5 Data mining process patterns

In the following, we give details on our approach on reusing available analysis processes by data mining process patterns. Our approach is based on the Cross Industry Standard Process for Data Mining (CRISP), which is a standard process model for data mining that depicts corresponding phases of a project, their respective tasks and relationships between these tasks (Shearer 2000). CRISP consists of the six phases of Business: Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment and defines generic tasks for each of these phases.

In detail, our approach is based on the definition of data mining process patterns, which allow for the description of processes at different levels of abstraction and a meta-process for applying these process patterns to new problems (Wegener and Rüping 2011). Figure 2 gives an example of how the reuse of data mining processes is supported. There exists a workflow created by user A that solves a certain analysis problem. To enable the reuse of this process, user A creates a data mining process pattern from the workflow. This is done by abstracting tasks that are not reusable directly according to the task hierarchy and to model the assumptions and prerequisites. As user A owns the workflow, he is the only person who knows all assumptions and details of the workflow. Thus, he is the right person to perform the abstraction. Other users do not have detailed knowledge on this, so it is harder for them to collect the correct assumptions and to abstract the tasks. User B, who wants to reuse the solution of user A, takes the pattern, checks the prerequisites and assumptions and creates a workflow by specializing the abstract tasks according to his specific needs if the assumptions and prerequisites are fulfilled. If the assumptions and prerequisites are not fulfilled, he cannot directly use the pattern. In this case, further steps for abstracting tasks are needed.

Fig. 2
figure 2

Procedure of reuse with data mining process patterns

For the visualization of process patterns, we use the Business Process Model and Notation (BPMN), which is a graphical representation for specifying business processes in a business process model (White and Miers 2008). The graphical notation for specifying processes is based on a flowcharting technique. Figure 3 visualizes which tasks of CRISP are mapped to the process pattern level.

Fig. 3
figure 3

The CRISP tasks that form the basis for data mining process patterns

In order to allow for a description of processes that support reuse at the level of the general CRISP model, of executable workflows and of abstractions in-between, we define the following different levels of granularity for tasks:

  • Executable level: A task is described at the executable level if there exists a description of the task that allows to execute it automatically. Tasks at the executable level consist of a mapping to an existing component and a set of already specified inputs. The inputs can be either directly defined in the configuration of the component, provided by results of previous tasks or provided as inputs for the overall process. A task at the executable level is called executable task.

  • Configurable level: A task is described at the configurable level if there exists a description of the task that specifies a mapping to an existing component and a set of configurable inputs that are needed by the component. The tasks have to be processed manually in terms of specifying the missing input. A task at the configurable level is called configurable task.

  • Structural level: A task is described at the structural level if there exists a description of the task in form of a graph G = (VE) comprising a set V of sub-tasks together with a set E of directed edges, which are 2-element subsets of V and a textual description on how to further specialize the sub-task(s). A task at the structural level is called structural task.

  • Conceptual level: A task is described at the conceptual level if there only exists a textual description of the task. A task at the conceptual level is called conceptual task.

Figure 4 describes how we visualize the different levels. Conceptual tasks consist of a textual description that includes information on how to further specialize the task. If, e.g., a task of a process is not reusable, it needs to be replaced such that the process becomes reusable. Thus, there might be a need to develop a new atomic data mining component or a data mining script to be able to reuse the process. The description of such tasks refers to the conceptual level. In addition, prerequisites for the process and manual tasks, e.g., a task for checking if the plots as results of an analysis are satisfying, can be described by conceptual tasks. The user needs to manually process such conceptual tasks.

Fig. 4
figure 4

Visualization of tasks levels

Structural tasks consist of a partially formalized description that pre-structures the task by a graph of sub-tasks and gives information on how to further specialize the task. Tasks for organizing components, for developing or adapting a workflow or for developing scripts from available existing components are described at the structural level. For instance, a data preprocessing task which consists of the two steps normalization and filtering, but the components for these steps are not yet specified, is a structural task. Structural tasks have to be processed manually by the user.

Configurable tasks are already bound to an existing component, but cannot be executed as there is further input needed. Tasks for parametrization of existing component, scripts or workflows, are configurable tasks. For instance a data fusion task which needs the identifiers for the records of two tables to join provided as parameters is a configurable task.

Executable tasks can be directly executed, which means that they can be used to describe the tasks of an executable workflow. Tasks for executing components, scripts and workflow are described at the executable level. No user interaction is needed for processing these tasks. For instance a task for executing an analysis that is fully specified by an R script is an executable task.

Thus, the task levels represent a hierarchy of tasks, where the executable tasks are described at the most detailed level and the conceptual tasks are described at the most general level. For instance a Clean Data task could be specified as human task (conceptual task), as component that deletes records with missing values (executable task) or replaces them by a user-defined value (configurable task), or as separate data mining process for the prediction of missing values (structural task).

Tasks of the task hierarchy can be specialized to tasks from lower levels by processing their textual description, by creating sub-tasks, by creating or selecting an existing component or by specifying the inputs for the tasks.

A data mining process pattern is a directed graph G = (VE) comprising a set V of vertices together with a set E of directed edges, which are 2-element subsets of V. The vertices consist of the tasks of the CRISP tasks as described in Fig. 3. Every specialization of this process pattern for an application is also a data mining process pattern.

An executable data mining process pattern is a pattern whose tasks are specified to the executable level. An executable data mining process pattern contains enough information to transform it into an executable process in a process environment. Further graphical elements from BPMN can be used for defining process patterns in more details.

In the next Section, we will present an example on how to create a process pattern from a given process and how to apply it for the creation of an executable workflow.

6 Case study—process pattern of a clinical trial scenario

In this Section we present a case study based on a clinical trial scenario. We will show how to create a process pattern from a data mining script by extracting configurable tasks and how to specialize the process pattern to create a workflow.

6.1 The clinical trial scenario

Genomic data are often analysed in the context of a clinical retrospective study, in some rare so far but increasing number of cases this can be a clinical trial or an intervention study. The analysis of this class of data involves a variety of data ranging from genomic, transcriptomic, proteomic data to imaging data. In addition, clinical and demographics data include attributes such as age, gender, tumour location, tumour diameter, tumour grade, tumour stage, histology, pathology attributes, nodal invasion, etc. The exact attributes for clinical data vary depending on the study, or trial and specific disease

One scenario of the p-medicine project describes a statistical analysis of tumour samples with associated gene expression data and clinical features. This analysis is a semi-standardized procedure which is usually performed by statisticians or bioinformaticians using several ad-hoc tools. In the scenario, cancer samples are analysed to identify gene signatures that may indicate whether a tumour is malignant or not, or whether the tumour will metastasise or not.

The aim of this scenario is to provide evidence that could assist in clinical decisions in the future. The patient is the focus and patient data are dealt with specifically. Although there is no mechanism to feed the results back to the single patients the results will increase the information about the disease and in the long term it will contribute to new and better treatment solutions. The scenario has the following inputs and outputs:

  • Input: cancer probes for ’Uveal melanoma’ in the Affymetrix HG-U133 Plus 2 format (set of CEL files, a single file is named like GSM550624.cel). Each file represents one tumour related to one patient and is anonymous. The files can be retrieved from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE22138/ or http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22138/. A manuscript has been published based on this data (Laurent et al. 2011).

  • Input: Clinical Data as CSV tables (e.g., named like GSE22138_clin.csv). These can be used to attach personal data to the anonymous cel file. The following clinical and personal properties are available: tissue, age, sex, eye (right, left), tumour location, tumour diameter (mm), tumour thickness (mm), tumour cell type, retinal detachment, extrascleral extension chromosome 3 status, months to the end point of metastasis.

  • Output: Heatmaps, survival statistics, Kaplan-Meier plots and tests.

The scenario includes a manual preparatory part. The input data need to be read, normalized and controlled. After the data are saved into a ’local workspace’, the data need to be manually examined by an expert. There are several reasons why the microarray data have to be normalized and among them are the following:

  • The array is divided in regions and several of them can have a systematic higher (lower) intensity and background compared with others.

  • Scratches on the glass.

  • Presence of dirty zones.

  • Undetectable spots (intensity lower than the background).

Quality control plots are used to detect if there are discrepancies between samples, e.g., if the samples belong to two or more different batches (batch effect). A degradation plot, an intensity-density plot and a box plot are used to check if one or more samples have a behaviour (outliers) that differs from the others (the main group). In particular, the box plot should be performed before and after normalization. The normalization process usually allows minimizing both the random and batch effects. After the data are verified as ’clear’, the analysis process can start. The first step is to find the differentially expressed genes between classes of samples. The result can then be visualized by a volcano plot or by an heatmap (see Fig. 5).

Fig. 5
figure 5

Examples of the types of plots and results produced by the scenario. From the left upper panel: intensity/density plot, survival analysis (Kaplan-Meier plot, metastases free survival), Heatmap (shows the regulation of the differentially expressed genes), Volcano plot (shows genes that are differentially expressed between the categories ’detach’ and ’non-detach’)

The Heatmap for example, is a method to visualize omic data and helps identify patterns of activity or expression with respect to clinical groups, for example patients with or without metastatic disease. The Heatmap visualization and interpretation is followed by a survival analysis where Cox regression is used, a risk index is generated and patient subgroups are visualized using the Kaplan-Meier plot, which shows the probability for survival of subsets of patients, for example those that are predicted as being at high risk and those that are predicted as low risk for a malignancy. The Kaplan-Meier plot will be complemented by tests, for example the Log-rank test for categorizing variables into two groups, serving as basis for providing the Hazard ratio between groups and its confidence limits. Figure 5 shows some examples of the type of plots and results produced by the scenario.

The scenario has been implemented as R scripts in the context of the p-medicine project. In detail, the scenario consists of six parts which could be considered as individual components:

  1. 1.

    Prepare Experiment—import and normalization of genomic data.

  2. 2.

    Quality Control—generation of plots for quality control before and after normalization.

  3. 3.

    Build Environment Structure—generation of data structure including clinical data for the analysis.

  4. 4.

    Find Differentially Expressed Genes—finding the differentially expressed genes and generation of the volcano plot and heatmaps.

  5. 5.

    Create Risk Index—generation of a risk-index for the survival analysis.

  6. 6.

    Survival Analysis—generation of Kaplan-Meier plots for survival analysis based on the risk index.

6.2 Process pattern of the scenario

The first step of creating the process pattern is to identify the individual components which have to be described by tasks in the process pattern. In our case, we split the scenario into 9 R scripts. The code that covers the import and the normalization of the genomic data is represented by the Prepare Experiment group including the tasks ReadExperimentData and NormalizeData. The code that deals with the generation of the 3 plots for checking the data quality is represented by the Quality Control group including the tasks QC Degradation, QC Intensities and QC LogIntensity vs. Density. The code that covers the import of the clinical data and the creation of the data structure for the analysis is represented by the BuildEnvironmentStructure task. The code for finding the differentially expressed genes including the creation of the volcano plot and the heatmaps, the risk index and the survival analysis are also represented by the respective tasks: FindDifferentiallyExpressedGenes, CreateRiskIndex and SurvivalAnalysis. Figure 6 visualizes the identification of the components in the script.

Fig. 6
figure 6

Identification of components and groups of components in the script

The code of the individual components is not directly reusable, as it is part of a stand-alone R script. The components, which should be reusable as individual R scripts, have to be abstracted to the configurable level to allow for reuse. However, this can be solved by adding headers and footers to the R scripts. The headers are responsible for loading the R libraries needed for each of the split script, which was done once at the beginning of the original script. Furthermore, the headers and footers take care of the data exchange between the split scripts by storing and loading the data of the R workspace. In addition to the headers and footers, it is necessary to specify parameters for the directories where the input data can be read from and where the output data have to be stored to in order to allow for reuse. By this, the components are transformed into configurable tasks. The parameters that have to be configurable are basically the folders in which the input data for the individual components reside and wherein the results should be written. The only part that is at conceptual level is the processing of the decision on the quality control, as this has to be performed manually anyway. In the original scenario this was done manually by the bioinformatician. Figure 7 visualizes the process pattern of the scenario in BPMN.

Fig. 7
figure 7

The process pattern of the scenario visualized in BPMN. The components have been transformed into process pattern tasks

6.3 Taverna implementation

The process pattern created in the previous Section can be specialized to create an executable workflow. In our case study, the scenario is implemented as workflow in Taverna (Hull et al. 2006). In detail, the overall workflow consists of 6 nested workflows that are connected to each other. The nested workflows represent the tasks and groups of tasks of the process pattern. The complete workflow is depicted in Fig. 8 for an overview and in Figs. 9, 10 and 11 in more details. The R scripts representing the components are attached via the R-plugin of Taverna which allows for the execution of R scripts within a Taverna workflow. These are visualized in dark blue in the figures.

Fig. 8
figure 8

Scenario implemented in Taverna (overview). The tasks and task groups of the process pattern are deployed into nested workflows and workflow tasks in Taverna. Further details can be seen in Fig. 9 (Part 1), Fig. 10 (Part 2) and Fig. 11 (Part 3)

Fig. 9
figure 9

Taverna screenshot (Part 1—including the nested workflows Prepare Experiment and Quality Control)

Fig. 10
figure 10

Taverna screenshot (Part 2—including the conceptual task for checking the quality control results and nested workflows Build Environment Structure and Create Cluster Diagram)

Fig. 11
figure 11

Taverna Screenshot (Part 3—including the nested workflows Create Risk Index and Survival Analysis)

The process starts with the first nested workflow Prepare Experiment (see Fig. 9). It has two input parameters that are passed from the workflow input fields: the path to the input data and a path to which the output is written. The latter is passed to all other nested workflows and R tasks in the workflow, thus making the tasks to executable tasks. Inside the nested workflow two R scripts are executed: ReadExperimentData and NormalizeData. In the R script ReadExperimentData, the datasets, which are based on affymetrix arrays, are read in and imported into variables. The data are accessible under a path that has to be specified as input parameter. In the NormalizeData script, the data are normalized. The result returned from the nested workflow is the path where the results are stored. The pink tasks in the workflow are fields containing further information on the execution and completion of the R-scripts. The nested workflow Quality Control consists of 3 R scripts, which check the quality of the imported and normalized data (see Fig. 9). The results are plots which have to be manually interpreted by the user.

After the completion of the quality control step, the user is asked if the the workflow is to be continued or not via an input field in the UI. This represents the conceptual task of the process pattern (see Fig. 10).

If the data quality was evaluated as sufficient, the next nested workflow that is executed is BuildEnvironment (see Fig. 10). The clinical data are imported and the data structure for the analysis is created. After that, the nested workflow ClusterDiagram is executed (see Fig. 10). It continues by finding the differentially expressed genes between established sub-groups of samples belonging to classes of interest FindDifferentiallyExpressed and produces cluster diagrams and heatmaps that provide information about which genes have an increased activity. Subsequently, the nested workflow RiskIndex is executed (see Fig. 11). It creates a risk index that is used for the survival analysis. Finally, the nested workflow SurvivalAnalysis is executed (see Fig. 11), where Kaplan-Meier plots are created based on clinical features and the risk index.

The output of the workflow is the directory containing the outputs of the R tasks and the indication on whether the quality control was successful. The workflow furthermore includes a clean-up task that removes intermediate results from the output directory.

7 Conclusion

In this paper, we presented our approach towards developing a data mining environment for personalized medicine. The approach aims at addressing the needs and requirements for applying data mining techniques to bioinformatics solutions in the context of the p-medicine project.

Challenges for today’s bioinformatics scenarios are the heterogeneous set of users in different locations, the large, distributed and heterogeneous data sources and multi-computer environments and complex process chains for the analysis. Our approach for addressing these challenges consists of 4 building blocks for the data mining environment:

  • a method for reusing data mining components created in single computer environments,

  • a method for developing new data mining components in distributed environments,

  • a pattern-based approach for reusing analysis processes including data mining components, and

  • GUI and system interfaces that allow users to work with the system or to extend the system in different levels of granularity.

In detail, the heterogeneous data will be addressed by extensibility mechanisms and support for ontologies, heterogeneous users will be supported by website-like and expert interfaces as well as by the ability to reuse existing components and processes, and complex process chains will be addressed by the data mining process pattern approach.

We presented details on data mining process patterns and a case study on how to create and apply data mining process patterns in the context of a clinical trial scenario. It was shown that it is possible to create a process pattern by abstracting executable tasks of a script and to apply this pattern by specializing it into a workflow including a manual task. Defining a detailed system architecture for the system and implementing and testing it in the context of the p-medicine project are left for future work.