Unit of Work Supporting Generative Scientific Workflow Recommendation

Zhang, Jia; Pourreza, Maryam; Lee, Seungwon; Nemani, Ramakrishna; Lee, Tsengdar J.

doi:10.1007/978-3-030-03596-9_32

Jia Zhang¹⁷,
Maryam Pourreza¹⁷,
Seungwon Lee¹⁸,
Ramakrishna Nemani¹⁹ &
…
Tsengdar J. Lee²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11236))

Included in the following conference series:

International Conference on Service-Oriented Computing

4258 Accesses
8 Citations

Abstract

Service discovery and recommendation is playing increasingly important role, as more and more reusable web services are published onto the Internet. Existing methods typically recommend either individual services, or multiple services without their interconnections. In contrast, this research aims to mine service usage history and extract units of work (UoWs) comprising a collection of services chained together through intermediate components. A novel technique is proposed in this paper to study how services collaborated, or could collaborate, in the form of reusable UoWs to serve various workflows (i.e., mashups), based on an evolving service social network. Upon receiving a scientific workflow request, a recommend-as-you-go algorithm simulates how human minds work and relies on a sliding aggressiveness gauge to incrementally recommend context-aware UoWs. In this way, we hope to move one step further toward automatic service composition. Extensive experiments on the real-world datasets demonstrate the effectiveness and efficiency of the UoW-oriented service recommendation approach.

You have full access to this open access chapter, Download conference paper PDF

RANGER: Context-Aware Service Unit of Work Recommendation for Incremental Scientific Workflow Composition

Utilizing Tags for Scientific Workflow Recommendation

Similarity assessment for scientific workflow clustering and recommendation

Article 14 October 2016

Keywords

1 Introduction

As service-oriented software engineering (SOSE) becomes mainstream, increasingly more software components have been published onto the Internet as reusable services (or so-called APIs as lightweight services). People can leverage and compose existing services as components to build new functionality (so-called mashup or workflow) faster than before [8]. Thus, it is becoming increasingly important to build techniques of service discovery and recommendation to help people find and compose suitable services from a sea of available candidates [12].

In science domain, our previous work [19] reveals that one major obstacle stopping scientists from reusing services (i.e., algorithms) developed by peer scientists is how to transform data types to feed in the (usually comprehensive) input of the algorithms. In Earth science, for example, a data analytics service may require more than a dozen of parameters [19]. Unless one knows exactly what such parameters mean and how to feed in data accordingly, she will be reluctant to reuse the service. When multiple external services are used in a scientific workflow, how to chain them remains a significant challenge. As indicated by our earlier work, the past linkage among services may be useful in service recommendation [20] and will help automatic service composition [7].

Take Fig. 1 as a highly simplified scenario. Assume that existing semantics-based methods (e.g., [10]) have recommended three services ($s_{1}$, $s_{2}$, and $s_{3}$) to be included in a workflow request. How to link them together, however, remains unresolved. As shown in Fig. 1, it is found that in a past workflow $wf_{1}$, $s_{1}$ and $s_{2}$ were used together with an indirect path linking between them: $s_{1} \longmapsto s_{2}$. This means that the output of service $s_{1}$ is transformed through some intermediate steps and becomes the input of service $s_{2}$. Similarly, in another past workflow $wf_{2}$, services $s_{2}$ and $s_{3}$ were indirectly linked together: $s_{2} \longmapsto s_{3}$. Such provenance means that one can reuse the past linkages to chain between $s_{1}$ and $s_{2}$, as well as $s_{2}$ and $s_{3}$. Note that although the three services were never used together in the past, by integrating the two chains $s_{1} \longmapsto s_{2}$ and $s_{2} \longmapsto s_{3}$, we will harvest an unprecedented service chain $s_{1} \longmapsto s_{2} \longmapsto s_{3}$. This example illustrates that by mining service usage history, the three services can be automatically chained together without worrying how to transform data among them.

The example shown in Fig. 1 directly motivates this research, aiming to systematically study how to mine service usage history and identify reusable, and maybe unprecedented, service chain snippets to facilitate automatic service composition. In concurrent database operations, a transaction is considered a unit of work (UoW) meaning that all operations encompassed in a UoW either all complete and become persistent, or have to rewind as if nothing happened. In this project, we borrow this term UoW to represent a collection of services chained together, maybe through some software components as glues, to fulfill some business goals. In other words, this research aims to mine workflow provenance and extract service units of work (UoWs) to facilitate new workflow design and development.

The contributions of this work is three-fold. First, we coined the concept of service unit of work and have developed a service social network to support the recording and retrieval of service UoWs. Second, we have developed novel algorithms to mine reusable service UoWs as a recommend-as-you-go service serving context-aware workflow queries. Third, our experiments over the real-life datasets have demonstrated the effectiveness and efficiency of our approach.

The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 introduces the UoW-driven workflow recommendation framework. Section 4 describes our service UoW mining and recommendation algorithms. Section 5 discusses and analyzes our experiments. Section 6 draws a conclusion.

2 Related Work

Service composition remains a fundamental topic in the field of service oriented computing [12]. Paik et al. [11] divide service composition into four phases: planning, discovery, selection, and execution. Receiving a user functional query, the four stages generate abstract workflow tasks, map abstract tasks to service instances, select a combination of service instances based on constraints, and run service instances, in a sequential order. According to their categorization, our work mainly falls in the third phase: service selection phase, where we aim to decide on concrete service instances for abstract workflow tasks. However, a typical service composition process carries two fundamental assumptions. First is that a planning phase has resulted in a structured workflow plan, i.e., a reference model [4]. Second is that identified service components will be linked to each other directly [2], meaning that the output of a service component has to be plugged into another service component as input. In contrast, our research raises the bar and aims to fill the gap by automatically chaining service components based on their past collaboration history. Therefore, the starting point of our work is a collection of abstract tasks identified, however their relationships remaining to be decided by our work. As a result, our work does not require a typical full service planning phase to generate a structure of abstract workflow tasks interconnected.

A typical service selection phase focuses on finding a combination of service instances mapping to the identified collection of abstract tasks, based on certain constrains typically as QoS requirements or semantic compliances. Researchers have applied various types of optimization methods to tackle the multiple QoS-constrained service combination problem, including evolutionary algorithms [15] and Integer Programming [17]. In contrast, our research is complementary to the QoS-based service selection, in the sense that we focus on automatically finding suitable connections among service components. In recent years, Rodríguez-Mier et al. [12] optimize global QoS of the composition based on a service match graph containing all possible semantically matched edges between service candidates. Different from their work where service connections are possible links based on their semantic input/output descriptions, service connections in our work come from historical successful service connections happened in past workflows. For scientific researchers, such service connections are more trustworthy.

Semantic web technologies introduce rich and machine understandable representations of service functions and properties, which facilitate automatic service composition [10]. For example, Li et al. [9] analyze user requirements and service descriptions and recommend services based on their semantic compatibility. In recent years, researchers further leverage natural language processing and machine learning to enhance service discovery and composition. For example, Xia et al. [16] develop a category-aware service clustering and recommending method for automatic workflow composition. In this work, we leverage their techniques [9, 16] to measure the distance toward the goal during service composition, to dynamically compute the greediness of our algorithm.

IBM’s MatchUp project [6] recommends workflow fragments (mashlets with glue patterns) based on user context. Similarly, Roy Chowdhury et al. [3] recommend composition patterns (data components and connectors) based on partial mashup. Our work differentiates from theirs in three significant ways. First, their work only consider service linkage patterns (glue patterns or connectors) in individual workflows. In contrast, our work considers service linkage patterns contributed by multiple workflows together, since we aggregate all past workflows into a service social network. Second, compared with their work aiming to link identified services instances, our work aims to link identified conceptual services and result in (partially) linked service instances. Third, while their work only recommends linkage patterns between services and programmers still have to develop actual glue code, our method will automatically reuse previous linkage code to chain services together.

In addition to automatically linking service components, our work also, as other workflow recommender systems do, recommends related services based on the partial workflow placed by users. Based on a partial workflow, VisComplete [7] examines comprised paths and recommends path extensions based on a repository of workflows represented as graphs. Similarly, Zhang et al. [18] and Deng et al. [4] extract workflow patterns (nodes and their upstream subgraphs) from past workflow provenance. Such patterns are used as templates to match partially finished workflow to recommend extending nodes. In addition, IBM’s MashupAdvisor [5] recommends related services based on conditional co-occurrence probability and semantic matching. In contrast to their works focusing on finding what may co-occur based on structural comparison, our method focuses on finding services toward fulfilling the user intent.

The Business Process Management (BPM) community has contributed a number of approaches that recommend patterns as reusable components based on label/text similarity (such as [14]) or control flow patterns (e.g., [1]). In contrast, our project focuses on finding glues to link service candidates together based on the past workflows carrying the usage history of the services. Our aim is to eliminate human efforts on linking services together. Furthermore, by constructing a service social network, our work may extract unprecedented service usage paths that may be useful for scientific experiments (i.e., workflow) design.

3 UoW-Driven Workflow Recommendation Framework

In this research, we propose a novel technique to facilitate workflow design and development. At each step during a workflow design, we recommend a context-aware runnable group (i.e., Unit of Work) comprising multiple services, based on a service social network constructed from service usage history. Meanwhile, a sliding-window-like gauge is developed to predict the aggressiveness of a user on selecting UoWs during the workflow design process.

Figure 2 illustrates the overall blueprint of our UoW-driven workflow recommendation framework using a highly simplified scenario. On the top left corner, a UoW network is constructed offline based on existing workflows and services. Meanwhile, the descriptions of the workflows and services are analyzed and their intents are stored in databases. At time 1, upon receipt of a user query with overall goal and conceptual services identified ($c_{1}$–$c_{4}$), the Recommendation Engine will decide the aggressiveness of the user (i.e., 4). After consulting with the UoW network and the workflow and service intent databases, a connected UoW with three service instances ($s_{1}$–$s_{3}$) will be recommended, covering three conceptual services ($c_{1}$–$c_{3}$). At time 2 after the user accepts the recommendation, her aggressiveness drops to 1. Based on the partial workflow (i.e., UoW) at hand, the Recommendation Engine will recommend service instance $s_{4}$. Since $s_{4}$ is not connected to the UoW, the user has to build the connection herself. This scenario shows that our technique can not only help to find concrete service instances, but also significantly save the user efforts on linking the service instances together. Note that a finished workflow at time 3 will be added into the UoW network and the workflow/service intent databases on the fly, as shown in Fig. 2. In the next sections, we will explain the technique in details.

3.1 UoW Network Construction

In the first step, we describe how to construct a UoW network.

Definition 1

(Unit of Work - UoW). A unit of work is a connected directed graph $uow={<}S',E'{>}$ extracted from a directed graph of a workflow $w={<}S,E{>}: uow \subseteq w$, iff:

1.
$S' \subseteq S$
2.
$E' \subseteq E $
3.
$\forall s_{i},s_{j}\in S', \exists $ path $s_{i} \longmapsto s_{j}$ and/or $\exists $ path $s_{j} \longmapsto s_{i}$

where S is the set of services (vertices) used by the workflow, and E is the set of edges linking between the services labeled with workflow identifier.

A uow might consist of a single service with no edge or it might be the entire workflow. The number of uows depends on the edges existing between various services. If we consider the set of all UoWs extracted from a workflow $w_{i}$ with k edges among services to be $\{uow_{1,i}, uow_{2,i}, \ldots , uow_{m,i}\}$, then the size of this set is $m\le 2^{k}$. The set of all units of work extracted from a collection of workflows can form a network of units of work as defined below.

Definition 2

(Network of Units of Work). A network of units of work over M workflows is defined as $N_{uow} = {<}S'', E''{>}$ where $S'' = \bigcup _{j=1}^{M} S_{j}$ is the set of all services included in all workflows, and $E'' = \bigcup _{j=1}^{M} E_{j}$ is the set of all the edges in the network each labeling with a workflow identifier.

The total number of uows that can be extracted from the whole network would be $\ge |S''|$ and $\le 2^{|E''|}$. A UoW network is constructed by creating a directed graph with vertices to be the set of all services contained in all workflows, and each labeled edge represents a path between a pair of services in a workflow.

3.2 Search Query Analysis

Based on our previous work [16] on service discovery driven by topic modeling, assume that the network $N_{uow}$ implies a collection of latent topics that can be automatically extracted from all comprising workflow descriptions. Such latent topics form an Intent space of T. Each service or workflow serves some functionality, meaning that each of them can be represented by a distribution of topics over T. Meanwhile, we use term context to represent the scenarios where a service can serve, also in the form of a distribution of topics over |T| based on its past contribution in workflows in the network.

Definition 3

(Service Intent and Context). A service $s \in S''$ is defined as a tuple $s= {<}\phi _{s},\psi _{s}{>}$. $\phi _{s}$ denotes the intent of service s as a distribution of topics over the Intent space of the network $N_{uow}$, where its |T|-dimensional vector of probabilities sum to 1: $\sum _{i=1}^{|T|}p_{i,s}=1$. $\psi _{s}$ denotes the context of service s as a union of the intent of workflows in the network $N_{uow}$, where service s is a component: $\psi _{s}=\bigcup _{j=1}^{|M|}{\phi _{w_{j}}}$. $\phi _{w}$ denotes the intent of workflow w as a distribution of topics over $N_{uow}$, where the |T|-dimensional vector of probabilities sum to 1: $\sum _{k=1}^{|T|}p_{k,w}=1$.

Note that through vector arithmetic, the context of a service also represents a distribution of topics over the Intent space of the network $N_{uow}$. Consequently, the intent of a uow can be viewed as a union of the intent of services included.

Definition 4

(Intent of Unit of Work). Intent of a unit of work u is defined as $\phi _{u} = \{ \phi _{1,u}, \phi _{2,u}, \ldots , \phi _{|T|,u}\}$, where the intent value can be calculated using a SoftMax function $\sigma $ such that:

$$\begin{aligned} \phi _{j,u} = \sigma (\sum _{s \in u} \phi _{j,s}) = \frac{e^{\sum _{s \in u} \phi _{j,s}}}{\sum _{j=1}^{|T|}e^{\sum _{s \in u} \phi _{j,s}}} \end{aligned}$$

(1)

where $\phi _{j,s}$ denotes the $j^{th}$ intent value of service $s \in u$ and $\phi _{j,u}$ denotes the $j^{th}$ intent value of uow u.

Before a user starts to design a workflow, a user query would be provided containing information about the user’s goal of her desired workflow, which means intents can also be extracted from the query to make it machine understandable.

Definition 5

(Intent of user query). Intent of a user query q, can be represented by $\phi _{q}$ as a distribution of topics in query q over the Intent space of the network $N_{uow}$.

With the introduction of the intent and context of service/workflow/query, a search query at a given time point can be defined as follows:

Definition 6

(Search Query). A search query at time t is a triple $q_{t} = {<}G_{q}, W_{t}, A_{t}{>}$, where $G_{q} = {<}\phi _{q}, Pr_{q}{>}$ is the final goal of the user which contains $\phi _{q}$ as the user’s intent and $Pr_{q}$ as the list of user’s desired conceptual services identified, respectively. $W_{t}$ represents the current partial workflow, and $A_{t}$ is the user’s current aggressiveness which will be defined shortly.

In the beginning, users may intuitively like to obtain UoW recommendation that covers as many as possible conceptual services intended, even though some irrelevant services (i.e., noises) are included. During a workflow design process, however, the aggressiveness of a user choosing larger-sized units of work would reduce as the workflow-in-progress grows towards the goal. Thus, a user’s aggressiveness can be defined as follows:

Definition 7

(User’s Aggressiveness). The aggressiveness of user u, with query q at time t, is a function of the number of user’s desired conceptual services $G_{q}$ and what she has achieved so far by her current workflow $W_{t}$:

$$\begin{aligned} \begin{aligned} A_t&= \alpha *{(1-coverage(W_{t},G_{q}.Pr_{q}))}+\beta *{similarity(\phi _{W_{t}},\phi _{q})}\\&\quad -\gamma *\frac{noise(W_{t},G_{q}.Pr_{q})}{|W_{t}|} \end{aligned} \end{aligned}$$

(2)

where coverage is a function that computes the overlap rate of the current partial workflow over the expected goals, similarity is a function that measures the semantic (i.e., intent) similarity between the partial workflow and the user query, and noise is a function that computes unwanted services introduced through the UoWs, normalized. $\alpha $, $\beta $, and $\gamma $ are coefficients such that $\alpha +\beta +\gamma =1$.

3.3 Basic UoW Extraction and Recommendation

After the UoW network is built, Algorithm 1 shows how to extract all candidate UoWs from the network based on a user search query $q_{t}$. This algorithm first finds the desired conceptual intent from the user’s goals that have not been satisfied yet by the user’s current workflow (line 3). Afterwards, for each of the conceptual services, we find all the services in the network that have similar intents with the similarity threshold of $\lambda $. In this way, we obtain candidate services that are clustered based on the conceptual services to which they are similar (line 4). In line 5, we fetch all the combinations from these services for satisfying the user’s remaining desired conceptual services.

For finding the candidate UoWs, we try to find UoWs from the network with connection to the user’s current workflow (lines 6–15). We first consider service $s_{i}$ from the current workflow to be our connection point to the candidate UoWs. Then, we remove all other services in the current workflow from the network and get all the weakly connected subgraphs from the remaining network. Next, we try to focus only on the subgraph which contains our connection point. Finally, for every combination of candidate services from different clusters, we will add all the connected subgraphs attached to the connection point to the UoW recommendation list.

Some candidate UoWs might not have any path in history to any of the services in the user’s current workflow. However, they might actually help the user towards her final goal and the user can decide how to attach them to her current workflow. Thus, for finding such UoWs (lines 16–24), we first remove all the current services from the network and then get all the weakly connected subgraphs in this remaining network. Then in the next step, for every combination of service candidates and every subgraph $subgraph_{j}$, we first find the services $S'$ from this combination that exist in the selected subgraph. Finally, we add all the candidate UoWs containing $S'$ in the $subgraph_{j}$ to the UoW candidate list.

In the very last line of the Algorithm 1 (line 25), we add all the candidate services to the recommendation list as individual UoWs. This way, if there is no path between them in the history network, we can still recommend single services to the user.

The function addUoWs used on lines 13 and 22 of Algorithm 1 is presented in Algorithm 2. This algorithm is used for recursively finding candidate UoWs in a specific component, containing a number of desired services. The base case in this algorithm (lines 2–6) is when the $uow_{hypothesis}$ is a weakly connected graph and this is when we can add this UoW to the recommendation list. Here, if we want the resulting UoWs to be connected to a connection point (which is, as described before, a service from the user’s current workflow), then it should also be among the services in the candidate hypothesis so that we can add it to the candidate list (lines 3–5). However, if the hypothesis is still not connected (lines 7–17), for every pair of services, we find the shortest path between them. If found, we add the path to the hypothesis and try to find the candidate UoWs with the remaining services and the extended $uow_{hypothesis}$.

4 UoW Recommendation As-You-Go

With the preliminaries defined, the UoW recommendation problem can be formalized as to search for a largest-sized unit of work from the network of UoWs, which is the closest to the goal of the user query with the minimum number of irrelevant services. A comprehensive user query may take multiple steps during some time period. At a given time, recommendation should take into consideration of the partial workflow placed by users.

Definition 8

(UoW Recommendation Problem). Given a search query q, the UoW recommendation problem at time t is defined as finding a unit of work (uow) in the network of UoWs ($N_{uow}$), such that under aggressiveness $A_{t}$:

1.
$max(coverage(uow,G_{q}.Pr_{q}))$
2.
$max(similarity(\phi _{uow},\phi _{q}))$
3.
$min(noise(uow,G_{q}.Pr_{q}))$

Based on the initial UoWs found in the last section, we develop an approach which can answer the UoW recommendation problem defined above. Algorithm 3 searches the network of UoWs along with the user’s query and recommends a set of UoWs.

In this algorithm, we first fetch a set of candidate UoWs by using Algorithm 1 and then for each of these candidates, we make some examinations. We check if the context is similar enough to the query’s context, and if the candidate has enough useful services with respect to the user’s aggressiveness and the noise in the whole candidate (lines 4–8). If the candidate UoW meets the criteria, it is added to the candidate list. It should be noted that we consider threshold $\delta $ for the context similarity. For checking the usefulness, we use the equation below:

$$\begin{aligned} \frac{N_{usefulServices_{i}}}{N_{V'_{i}}-N_{usefulServices_{i}}} \le A_{t} \end{aligned}$$

(3)

Based on the above equation, we want the number of useful services in the $uow_{i}$ with regard to the number of noise services (irrelevant services in $uow_{i}$) to be no more than the aggressiveness of the user at the time. In case the number of noise is equal to zero, we only consider the number of useful services.

Finally in line 4, we sort the candidate list based on the $\frac{N_{usefulServices_{i}}}{N_{V'_{i}}-N_{usefulServices_{i}}}$ in descending order and return the list to the user.

5 Experiments and Analysis

We have designed and conducted a collection of experiments to evaluate the effectiveness and efficiency of our method over a real-world dataset.

5.1 Experimental Setup

We chose myExperiment.org as our testbed since it is the world largest service-oriented scientific workflow repository. Its metadata was used to analyze workflow intent, and source code to analyze UoWs. Since the majority of workflows were developed using Taverna tool, we crawled all the publicly available Taverna workflows up to May 2018. This resulted in 3,277 workflows in different versions from 2,030 unique workflows for our experiments. For each workflow, we scrutinized all the services contained and extracted 513 unique services coming in various types including: WSDL, SoapLab, BioMoby, and SADI services along with the REST calls. Among all the initial list of workflows, we obtained 1,719 workflows which invoke at least one of those services. Totally, from these services, 1,248 unique operations are used in all the workflows. We thus create our UoW network as summarized in Table 1.

Table 1. Summary of experiments

Full size table

Over the dataset, all workflows were sorted based on their creation dates. We designed two types of experiments to study how our technique would have worked in the history, starting from the oldest workflows toward the most recent ones. The difference between them is the way to create the UoW network. In the first type of experiment, for each workflow under study, the UoW network contains all in prior workflows. In the second type of experiment, for every workflow, the UoW network remains almost the same - including all workflows in the testbed except the one under study and its subsequent versions. The rationale is that the UoW network would become very rich after many workflows join.

For each type of experiment, chronologically, each workflow served as a search query, including its description and conceptual components (i.e., the processor names defined in all Taverna workflows). For each query, we ran our recommendation algorithm, used the top recommended UoW and continued the recommend-as-you-go process, until either the goal of the workflow was reached or the algorithm could not find any more UoWs.

Our baseline methods are the recommendation methods based on keywords, semantics (i.e., the method in [12]) and patterns (i.e., the methods in [4, 18]). All baseline approaches were applied to our two types of experiments for comparison.

We used $\lambda = 0.2$ and $\delta = 0.01$ for the experiments. The Dijkstra algorithm [13] was used for finding the shortest paths in the UoW network. The experiments were run on a Windows 10 64-bit machine with 64 GB memory and Intel i7-7700 CPU @ 3.60 GHz*8. The code was implemented in Java 8 and JGraphT library was used for building the graphs and algorithms on top of them.

5.2 Evaluation Metrics

In order to evaluate our method, we adopted four evaluation metrics. Note that the measurements are only applied to the final step of our method, because our method operates a multi-step procedure to identify UoWs and services, controlled by dynamically changing aggressiveness.

1.
Average Precision: The average number of correctly predicted services at the final step of the recommend-as-you-go process, for the top recommended UoWs over the total number of services predicted $\frac{N_{correctly-predicted}}{N_{total-recommended-services}}$.
2.
Average Recall: The average number of correctly predicted services at the final step of the recommend-as-you-go process, for the top recommended UoWs over the total number of services in the test workflow $\frac{N_{correctly-predicted}}{N_{total-WF-services}}$.
3.
Accumulated Saved Steps: The accumulated number of steps saved for reaching the last result, by using our approach.
4.
Accumulated Saved Links: The accumulated number of decisions about the connections among services saved for the user by our approach.

5.3 Experimental Results and Analysis

Recommendation Effectiveness: Since the starting point of the experiments are just intents (i.e., descriptions) and a collection of conceptual components (i.e., services candidates), without structural design (i.e., reference model), the baseline methods based on patterns [4, 18] will not work in the beginning.

We then studied how the pattern-based baseline methods could work after our method identified the first-round of UoWs, meaning that when a partial structure exists. For type 1 experiment, out of the 1,719 workflows in the testbed, 191 workflow designs require multiple steps in our method. For those workflows, the baseline methods recommended services for 7 workflows. For type 2 experiment, the baseline methods recommended 16 services out of 291 workflows that require multiple steps in our method. The reason is that most of the services contained in a workflow were not used together in earlier workflows, thus their co-occurrence was not recorded in the pattern table/repository in the baseline methods. This situation is common in scientific research, because scientific workflows typically imply unprecedented experiment design.

We also studied whether the semantics-based baseline approach [12] could work on the testbed. It does not due to a fact that scientific services (i.e., algorithms) typically cannot be directly chained together without data transformation [19]. Thus, semantic input/output match-making among services did not work well for the testbed.

In contrast, our approach successfully recommended services components for all 1,719 workflows. Therefore, we will only discuss in detail the recommendation quality of our approach under the two types of experiments.

Recommendation Quality: Figure 3 compares the precision and recall of our algorithm in the two types of experiments aforementioned. For each diagram, the x axis represents the timeline, and the y axis represents the average precision/recall for each day if our recommendation algorithm is applied on the workflows published on the date. Figure 3(a) and (b) compare the recommendation precision and recall for all workflows in the testbed. For each figure, Test 1 and Test 2 curves represent the precision/recall of our algorithm applied on type 1 scenario (where $N_{uow}$ only comprises all workflows published prior to the date) and type 2 scenario (where $N_{uow}$ contains all workflows except the one under examination and its subsequent versions), respectively.

As it can be seen, type 2 experiment with a rich UoW network has demonstrated more effective results, especially in the early times. For type 1 experiment, the UoW network for those workflows in early experiments contains insufficient services and links to be used for the recommendation. However, as times goes by, both precision and recall improve by having a richer UoW network.

From the 1,719 workflows, we also removed the ones with only one service in order to investigate the results for more complex scenarios. Figure 3(c) and (d) present the comparison results of the algorithms for the remaining 511 workflows.

Over the entire testbed, the average precision/recall of our proposed method are 44/43% for type 2 experiments, comparing to 29/31% for type 1 experiments. For workflows containing at least two external services, the average precision/recall of our method are 52/43% for type 2 experiments, comparing to 34/25% for type 1 experiment.

Development Efforts Saved: Figure 4 shows that accumulated development efforts would have been saved by our approach, from the inception of myExperiment.org to May 2018. The baseline method is a keyword-based method, finding one service at a time. For each diagram, the x axis represents the timeline. Figure 4(a) examines the development steps saved. The y axis represents the development steps saved for each day, if our algorithm was applied on the workflows published on the date. For each workflow comprising N conceptual services, the baseline method will require N steps. In contrast, our approach recommends units of work each may comprising multiple services. Therefore, development steps may be saved. Assume our algorithm recommends K UoWs, each containing $|uow_{i}|$ services. We will have: $\sum _{i=1}^{K} |uow_{i}| = N$. If each uow is a single service, $K=N$; otherwise, $K<N$. For each $uow_{i}$, $|uow_{i}|-1$ steps will be saved. Therefore, for the entire process with K UoWs, a total of $\sum _{i=1}^{K} (|uow_{i}|-1) = N-K$ steps will be saved. For the testbed, we identified UoWs for 107 workflows under type 1 experiments, and 183 workflows under type 2 experiments. As shown in Fig. 4(a), our algorithm accumulatively would have saved 303 or 166 steps comparing to the baseline methods, in two types of experiments respectively.

Figure 4(b) studies the service linkages saving using our approach. Unlike the baseline method recommending only services and scientists having to chain services manually, our approach recommends units of work including the linkages among services. Consider again the same example above, where a workflow with N conceptual services and our algorithm recommending K UoWs. Using the baseline method, one has to make decision on $C_{N}^{2}$ connections. Using our method, a total of $\sum _{i=1}^{K} C_{|uow_{i}|}^{2}$ link consideration could have been saved. As shown in Fig. 4(b), a rich UoW network significantly facilitates linkage saving, from an accumulative 460 linkages comparing to about 242 linkages. As discussed earlier, chaining among data services may be a hard problem. As a result, our approach may significantly save scientists’ time and let them focus on science.

Time Complexity: We also studied the performance of applying our recommendation algorithm online. Using the same example where a workflow with N conceptual service and a UoW network with $S''$ services and $E''$ edges, the time complexity would become $O(S''^{N} log (S''^N))$. First, finding candidate services will cost $O(S''N)$. Second, finding all their combinations for different clusters will cost $O(S''^N)$. Third, finding all service instances in the network will cost $O(|S''| + |E''|)$. Fourth, finding the shortest path between services will cost $O(|S''|^2)$. Fifth, sorting candidates will cost $O(S''^(N+1) log(S''^(N+1)))$.

6 Conclusions

In this paper, we have presented a novel technique to facilitate workflow design and development. At each step, we recommend a runnable group comprising multiple services, based on a service social network constructed from service usage history. Such a technique stands out for four significant reasons. First, recommending multiple services instead of a single service shall shorten the overall design phase. Second, the recommended service group is derived from past workflow usage history with analogous context, so the recommendation becomes more trustworthy. Third, some unprecedented service collaboration patterns may be extracted through the cross-workflow mining over the service social network. Fourth, the recommended services are chained thus to save users efforts to interconnect them, which may be not only time consuming but also difficult.

We plan to continue our research in the following two directions. First, we will further study the scalability of our technique and tune performance. Second, we will study how to integrate our approach with existing methods focusing on other phases during service composition process, toward building an end-to-end service composition recommendation methodology and a tailored platform.

References

Van Der Aalst, W.M.P., Ter Hofstede, A.H.M., Kiepuszewski, B., Barros, A.P.: Workflow patterns. Distrib. Parallel Databases 14, 5–51 (2003)
Article Google Scholar
Bevilacqua, L., Furno, A., di Carlo, V.S., Zimeo, E.: A tool for automatic generation of WS-BPEL compositions from OWL-S described services. In: Proceedings of IEEE International Conference on Software, Knowledge Information, Industrial Management and Applications, Benevento, Italy, pp. 1–8 (2011)
Google Scholar
Roy Chowdhury, S., Daniel, F., Casati, F.: Efficient, interactive recommendation of mashup composition knowledge. In: Kappel, G., Maamar, Z., Motahari-Nezhad, H.R. (eds.) ICSOC 2011. LNCS, vol. 7084, pp. 374–388. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25535-9_25
Chapter Google Scholar
Deng, S., et al.: A recommendation system to facilitate business process modeling. IEEE Trans. Cybern. 47(6), 1380–1394 (2017)
Article Google Scholar
Elmeleegy, H., Ivan, A., Akkiraju, R., Goodwin, R.: MashupAdvisor: a recommendation tool for mashup development. In: Proceedings of IEEE International Conference on Web Services, pp. 337–344. IEEE, Beijing (2008)
Google Scholar
Greenshpan, O., Milo, T., Polyzotis, N.: Autocompletion for mashups. Proc. VLDB Endow. 2(1), 538–549 (2009)
Article Google Scholar
Koop, D., Scheidegger, C.E., Callahan, S.P., Freire, J., Silva, C.T.: VisComplete: automating suggestions for visualization pipelines. IEEE Trans. Vis. Comput. Graph. 14, 1691–1698 (2008)
Article Google Scholar
Lemos, A.L., Daniel, F., Benatallah, B.: Web service composition: a survey of techniques and tools. ACM Comput. Surv. 48(3), Article no. 33 (2016)
Article Google Scholar
Li, C., Zhang, R., Huai, J., Guo, X., Sun, H.: A probabilistic approach for web service discovery. In: Proceedings of IEEE International Conference of Services Computing, pp. 49–56. IEEE, Santa Clara (2013)
Google Scholar
McIlraith, S.I., Son, T.C., Zeng, H.: Semantic web services. IEEE Intell. Syst. 16(2), 46–53 (2001)
Article Google Scholar
Paik, I., Chen, W., Huhns, M.N.: A scalable architecture for automatic service composition. IEEE Trans. Serv. Comput. 7(1), 82–95 (2014)
Article Google Scholar
Rodríguez-Mier, P., Mucientes, M., Lama, M.: Hybrid optimization algorithm for large-scale QoS-aware service composition. IEEE Trans. Serv. Comput. 10(4), 547–559 (2017)
Article Google Scholar
Skiena, S.: Dijkstra’s Algorithm. Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica, pp. 225–227. Addison-Wesley, Reading (1990)
Google Scholar
Smirnov, S., Weidlich, M., Mendling, J., Weske, M.: Action patterns in business process models. In: Baresi, L., Chi, C.-H., Suzuki, J. (eds.) ICSOC/ServiceWave-2009. LNCS, vol. 5900, pp. 115–129. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10383-4_8
Chapter Google Scholar
Wada, H., Suzuki, J., Yamano, Y., Oba, K.: E3: a multiobjective optimization framework for SLA-aware service composition. IEEE Trans. Serv. Comput. 5(3), 358–372 (2012)
Article Google Scholar
Xia, B., Fan, F., Tan, W., Huang, K., Zhang, J., Wu, C.: Category-aware API clustering and distributed recommendation for automatic mashup creation. IEEE Trans. Serv. Comput. 8(5), 674–687 (2015)
Article Google Scholar
Zeng, L., Benatallah, B., Ngu, A.H., Dumas, M., Kalagnanam, J., Chang, H.: QoS-aware middleware for web services composition. IEEE Trans. Softw. Eng. 30(5), 311–327 (2004)
Article Google Scholar
Zhang, J., Liu, Q., Xu, K.: FlowRecommender: a workow recommendation technique for process provenance. In: Proceedings of the Australasian Data Mining Conference, pp. 55–61. Australian Computer Society, Inc. (2009)
Google Scholar
Zhang, J., et al.: Climate analytics workflow recommendation as a service - provenance-driven automatic workflow mashup. In: Proceedings of IEEE International Conference on Web Services, pp. 89–97. IEEE, New York (2015)
Google Scholar
Zhang, J., Tan, W., Alexander, J., Foster, I., Madduri, R.: Recommend-as-you-go: a novel approach supporting services-oriented scientific workflow reuse. In: Proceedings of IEEE International Conference on Services Computing, pp. 48–55. IEEE, Washington DC (2011)
Google Scholar

Download references

Acknowledgement

This work is partially supported by National Aeronautics and Space Administration under grant NNX16AB22G, and National Science Foundation under grant ACI-1443069.

Author information

Authors and Affiliations

Carnegie Mellon University, Mountain View, USA
Jia Zhang & Maryam Pourreza
NASA Jet Propulsion Laboratory, Pasadena, USA
Seungwon Lee
NASA Ames Research Center, Mountain View, USA
Ramakrishna Nemani
Science Mission Directorate, NASA Headquarters, Washington, D.C., USA
Tsengdar J. Lee

Authors

Jia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Pourreza
View author publications
You can also search for this author in PubMed Google Scholar
Seungwon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Ramakrishna Nemani
View author publications
You can also search for this author in PubMed Google Scholar
Tsengdar J. Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zhang .

Editor information

Editors and Affiliations

Free University of Bozen-Bolzano, Bolzano, Italy
Claus Pahl
IBM Research Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Maja Vukovic
Zhejiang University, Hangzhou, China
Jianwei Yin
Rochester Institute of Technology, Rochester, NY, USA
Qi Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Pourreza, M., Lee, S., Nemani, R., Lee, T.J. (2018). Unit of Work Supporting Generative Scientific Workflow Recommendation. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds) Service-Oriented Computing. ICSOC 2018. Lecture Notes in Computer Science(), vol 11236. Springer, Cham. https://doi.org/10.1007/978-3-030-03596-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-03596-9_32
Published: 07 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03595-2
Online ISBN: 978-3-030-03596-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics