Skip to main content
Log in

Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

In the age of Business-to-Business (B2B) collaboration, ensuring reliability of workflows underlying inter-organizational business processes is of significant importance. There are, however, quite a few challenges towards achieving seamless operation. Such challenges arise from heterogeneity in infrastructure and coordination mechanism at participant organizations, as well as time and cost associated with recovery from failure. Our research presents foundations for a reliable scheme for recovery from failure of workflow processes spanning through multiple business entities. First, a system model is adapted from the mobile computing literature that serves to establish the requirements to be enforced by each participating organization. In our model, we adopt the Maximal Sequence Path (MSP) approach from Yoo et al. (Lecture Notes in Artificial Intelligence 2132:222–236, 2001), as a means of decomposing workflows into mobile agent-driven processes that communicate via web services at each organization. This decomposition ensures defining logical points within the dynamics of a workflow instance for locating accurate and consistent states of the system for recovery in case of a failure. Then, a set of algorithms for various business scenarios are developed and presented as practical solutions. These algorithms are shown to create checkpoints such that the system is always in a globally consistent state. As such, these algorithms constitute a set of standards that can be incorporated in business process management suites that support reliable inter-organizational collaboration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Acharya, A., & Badrinath, B. R. (1994). Checkpointing distributed applications on mobile computers. Paper presented at the Third International Conference on Parallel and Distributed Information Systems.

  • Alonso, G., Hagen, C., Agrawal, D., El Abbadi, A., & Mohan, C. (2000). Enhancing the fault tolerance of workflow management systems. Concurrency, IEEE, 8(3), 74–81. IEEE Parallel & Distributed Technology.

    Article  Google Scholar 

  • Anderson, M., & Allen, R. (1999). Workflow interoperability: Enabling E-commerce. from http://www.wfmc.org/standards/docs/IneropChallPublic.PDF

  • Badrinath, B. R., Acharya, A., & Imielinski, T. (1996). Designing distributed algorithms for mobile computing networks. Computer Communications, 19(4), 309–320.

    Article  Google Scholar 

  • Basu, A., & Blanning, R. W. (2000). A formal approach to workflow analysis. Information Systems Research, 11(1), 17–36.

    Article  Google Scholar 

  • Borg, A., Blau, W., Graetsch, W., Herrmann, F., & Oberle, W. (1989). Fault tolerance under unix. Acm Transactions on Computer Systems, 7(1), 1–24.

    Article  Google Scholar 

  • Brambilla, M., Ceri, S., Comai, S., & Tziviskou, C. (2005). Exception handling in workflow-driven Web applications. Paper presented at the Proceedings of the 14th international conference on World Wide Web.

  • Bui, T., & Lee, J. (1999). An agent-based framework for building decision support systems. Decision Support Systems, 25(3), 225–237.

    Article  Google Scholar 

  • Cabrera, L. F., Copeland, G., Cox, W., Feingold, M., Freund, T., Johnson, J., et al. (2003). Web Services Coordination (WSCoordination) o. Document Number)

  • Cai, T., Gloor, P. A., & Nog, S. (1997). DartFlow: A workflow management system on the web using transportable agents. Unpublished manuscript.

  • Chafle, G., Dasgupta, K., Kumar, A., Mittal, S., & Srivastava, B. (2006). Adaptation in Web Service Composition and Execution. Paper presented at the Web Services, 2006. ICWS '06. International Conference on.

  • Chandy, K. M., & Lamport, L. (1985). Distributed snapshots—determining global states of distributed systems. Acm Transactions on Computer Systems, 3(1), 63–75.

    Article  Google Scholar 

  • Chen, M. Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., et al. (2004). Path-based faliure and evolution management. Paper presented at the Symposium on Networked Systems Design and Implementation.

  • Chiu, D. K. W., Cheung, S. C., Till, S., Karlapalem, K., Li, Q., & Kafeza, E. (2004). Workflow view driven cross-organizational interoperability in a web service environment. Information Technology and Management, 5(3–4), 221.

    Article  Google Scholar 

  • Chrysanthis, P. K., Znati, T., Banerjee, S., & Shi-Kuo, C. (1999). Establishing virtual enterprises by means of mobile agents. Paper presented at the International Workshop on Research Issues on Data Engineering, Sydney.

  • Colombo, E., Francalanci, C., & Pernici, B. (2002). Modeling coordination and control in cross-organizational workflows. In Lecture Notes in Computer Science (Vol. 2519/2002, pp. 91–106). Springer Berlin: Heidelberg.

  • Cone, E. (2006). Boeing: New Jet, New Way of Doing Business [Electronic Version], from http://www.cioinsight.com/c/a/Case-Studies/Boeing-New-Jet-New-Way-of-Doing-Business/

  • Dialani, V., Miles, S., Moreau, L., Roure, D. D., & Luck, M. (2002). Transparent fault tolerance for web services based architectures. In Lecture Notes in Computer Science (Vol. 2400/2002, pp. 107–201). PaderBorn: Springer Berlin/Heidelberg.

  • Dini, P., Lombardo, G., Mansell, R., Razavi, A. R., Moschoyiannis, S. K., Krause, P. J., et al. (2008). Beyond interoperability to digital ecosystems: Regional innovation and socio-economic development led by SMEs. International Journal of Technological Learning, Innovation and Development, 1(3), 410–426.

    Article  Google Scholar 

  • Dobson, G. (2006). Using WS-BPEL to implement software fault tolerance for web services. Paper presented at the Software Engineering and Advanced Applications, 2006. SEAA’06. 32nd EUROMICRO Conference on.

  • Elnozahy, E. N., & Plank, J. S. (2004). Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions of Dependeable and Secure Computing, 1(2), 97–108.

    Article  Google Scholar 

  • Feldman, J. (2010). Cloud contracts and SLAs o. Document Number)

  • Gartner. (2007). Gartner identifies the top 10 strategic technologies for 2008 Retrieved June, 2008, from http://www.gartner.com/it/page.jsp?id=530109

  • Goul, M., Satyavikas, K., & Demirkan, H. (2003). Towards web services standards for fault tolerance capabilities in inter-organizational workflow management systems. Paper presented at the International Conference on Web Services.

  • Kamath, M., & Ramamritham, K. (1998). Pragmatic issues in coordinated execution and failure handling of workflows in distributed workflow control architectures.

  • Khan, Z. A., Shahid, S., Ahmad, H. F., Ali, A., & Suguri, H. (2005). Decentralized architecture for fault tolerant multi agent system. Paper presented at the Autonomous Decentralized Systems, 2005. ISADS 2005. Proceedings.

  • Kock, N. (2006). System analysis & design fundamentals—A business process redesign approach. Sage Publications.

  • Kumar, A., & Zhao, J. L. (1999). Dynamic routing and operational controls in workflow management systemms. Management Science, 45(2), 253–272.

    Article  Google Scholar 

  • Liu, C., Li, Q., & Zhao, X. (2008). Challenges and opportunities in collaborative business process management: Overview of recent advances and introduction to the special issue. Information Systems Frontiers, 3, 3.

    Google Scholar 

  • Lyu, M. R., Chen, X., & Wong, T. Y. (2004). Design and evaluation of a fault-tolerant mobile-agent system. Intelligent Systems, 19(5), 32–38.

    Article  Google Scholar 

  • Mcafee, A., Dessain, V., & Sjoman, A. (2007). Zara: IT for fast fashion. Harvard Business Case, #9-604-081.

  • Merz, M., Liberman, B., Muller-Jones, K., & Lamersdorf, W. (1996). Interorganisational workflow management with mobile agents in COSM. Paper presented at the the Practical Application of Agents and Multiagent Systems.

  • Murphy, A. L., & Picco, G. P. (2002). Reliable communication for highly mobile agents. Autonomous Agents and Multi-Agent Systems, 5(1), 81–100.

    Article  Google Scholar 

  • Narendra, N. C. (2004). Flexible support and management of adaptive workflow processes. Information Systems Frontiers, 6(3), 247.

    Article  Google Scholar 

  • Nichols, J., Demirkan, H., & Goul, M. (2006). Autonomic workflow execution in the grid. Systems, Man and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 36(3), 353–364.

    Article  Google Scholar 

  • Osman, T., Wagealla, W., & Bargiela, A. (2004). An approach to rollback recovery of collaborating mobile agents. Systems, Man and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 34(1), 48–57.

    Article  Google Scholar 

  • Patterson, D. (2002). A simple way to estimate cost of downtime. Paper presented at the System Administration Conference.

  • Priami, C. (1997). Qualitative and quantitative analysis of mobile systems. from http://citeseer.ist.psu.edu/110836.html

  • Quaglia, F. (1998). Checkpointing protocols in distributed systems with mobile hosts: A performance analysis o. Document Number)

  • Razavi, A. R., Moschoyiannis, S. K., & Krause, P. J. (2007). Concurrency control and recovery management for open e-Business transactions. Paper presented at the Concurrency Control and Recovery Management for Open e-Business Transactions.

  • Schulz, K. A., & Orlowska, M. E. (2004). Facilitating cross-organisational workflows with a workflow view approach. Data & Knowledge Engineering, 51(1), 109–147.

    Article  Google Scholar 

  • Schuster, H. (2005). Pros and cons of distributed workflow execution algorithms. In T. Harder & W. Lehner (Eds.), Data management in a connected world (Vol. 3551/2005). Berlin/Verlag: Springer.

    Google Scholar 

  • Sen, S., Demirkan, H., & Goul, M. (2005). Towards a verifiable checkpointing scheme for agent-based interorganizational workflow system “Docking Station” standards. Paper presented at the 38th Annual Hawaii International Conference on System Sciences, Big Island, Hawaii.

  • Shrivastava, S. K., & Wheater, S. M. (1999). Workflow management systems. IEEE Concurrency, 7, 3.

    Article  Google Scholar 

  • Stohr, E. A., & Zhao, J. L. (2001). Workflow automation: Overview and research issues. Information Systems Frontiers, 3(3), 281.

    Article  Google Scholar 

  • Strom, R. E., & Yemini, S. (1985). Optimistic recovery in distributed systems. Acm Transactions on Computer Systems, 3(3), 204–226.

    Article  Google Scholar 

  • Tagg, R. (2001). Workflow in different styles of virtual enterprise. Paper presented at the Workshop on Information technology for Virtual Enterprises, Queensland, Australia

  • Vallee, G., Engelmann, C., Tikotekar, A., Naughton, T., Charoenpornwattana, K., Leangsuksun, C., et al. (2008). A framework for proactive fault tolerance. Paper presented at the Availability, Reliability and Security, 2008. ARES 08. Third International Conference on.

  • van der Aalst, W. (1999). Interorganizational workflows: An approach based on message sequence charts and petri nets. Systems Analysis—Modelling—Simulation, 34(3), 335–367.

    Google Scholar 

  • van der Aalst, W. (2000). Inheritance of interorganizational workflows: How to agree to disagree without loosing control.Unpublished manuscript, Boulder.

  • van der Aalst, W. (2001). The P2P approach to Interorganizational Workflows. Paper presented at the 13th International Conference on Advanced Information Systems Engineering (CAiSE’01).

  • Verginadis, Y., & Mentzas, G. (2008). Agents and workflow engines for inter-organizational workflows in e-government cases. Business Process Management Journal, 14(2), 188.

    Article  Google Scholar 

  • Yoo, J.-J., Lee, D., Suh, Y.-H., & Lee, D.-I. (2001). Scalable workflow system model based on mobile agents. Lecture Notes in Artificial Intelligence, 2132, 222–236.

    Google Scholar 

  • Zhao, J. L. (2002). Interorganizational workflow and E-commerce applications. Hawaii International Conference on Systems Sciences (HICSS).

  • Zhao, J. L., & Cheng, H. K. (2005). Web services and process management: A union of convenience or a new area of research? Decision Support Systems, 40(1), 1–8.

    Article  Google Scholar 

  • Zhao, X., & Liu, C. (2006). Tracking over Collaborative Business Processes. Paper presented at the International Conference on Business Process Management.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haluk Demirkan.

Appendices

Appendix A: Skeletal proof

1.1 Background: System model rules for each node in a workflow

  1. 1.

    If (state = ‘Receive’) then wait for a message. If it is a ‘new unit of work’ message, then send an acknowledgement to the source and switch to state = ‘Process’. If it is an acknowledgement message, checkpoint the processed unit of work, send and receive messages. (Note: If more than one message arrives simultaneously, they will be placed into the FIFO queue)

  2. 2.

    If (state = ‘Process’) perform the specified task, and upon completion switch to state = ‘Send’

  3. 3.

    If (state = ‘Send’) then send the processed unit of work message to the workflow engine or the next node, and then switch to state = ‘Receive’

(Note: Idle nodes are presumed to be in the ‘Receive’ state)

1.2 Step 1: Correctness of checkpoints

For an inter-organizational workflow to recover to a globally consistent state when using checkpoints, there must be a set of rules for each node to follow to create independent checkpoints. Those rules are given above. The intent is that when there is a failure, checkpoints can be assembled to enable reconstruction of a global state with integrity guarantees. When using checkpoints, some rework may need to be done when there is a failure, but the most important property of a recovery process is to guarantee a globally consistent state. As is true of most related proofs regarding failure and recovery algorithms, it is impossible to consider all possible failure scenarios in all possible workflow contexts, particularly those with combinations of failures. Therefore, we focus in this skeletal proof on what might be referred to as a normal workflow context, characterized as follows:

  1. 1.

    The workflow is presumed to be complete and capable of being compiled into a tractable set of MSPs,

  2. 2.

    A node cannot alter an MSP for a given workflow,

  3. 3.

    It is assumed that a legal or correct agent at each node operates on a unit of work, that

  4. 4.

    It is assumed that the writing of a checkpoint proceeds without failure or the node will be deemed as failing, and

  5. 5.

    It is assumed that the last non-failed node in an MSP with a failure will have an available checkpoint.

Now, suppose there is some arbitrary checkpoint approach, implying that every node in a workflow follows its own state-transition rules to independently checkpoint its local state. Then, it is not possible to state with certainty that a checkpoint in a node (e.g., B) that follows in the workflow from another node in that workflow (e.g., A) creates a checkpoint at a time after the subsequent node completes its work (e.g., B checkpoints before A since the checkpoint time at A is arbitrary). We can say that in this arbitrary context, it is not guaranteed that an earlier checkpoint at the node ‘depends on’ a later checkpoint at a node.

In the node rules stated above, there is not an arbitrary checkpoint—instead, the rules guarantee a recoverable checkpoint. This is because for a particular workflow instance, if a unit of work is checkpointed by some node A and passed on to a node B according to the flow designated in the MSP, then processed by B and checkpointed by B, then checkpoint a (amount of information logged at node A after task is completed at A and delivered to node B) is said to be included in checkpoint b (a + amount of information logged at node B after the task is completed at B and delivered to the next node). This is because the system model assumes that all completed work at a node, i.e., the payload, is cumulative—the work of a node prior in the MSP is passed on to subsequent nodes in the MSP.

This ‘depends on’ relationship (denoted ‘>’) is an important criterion of checkpoint-based recovery approaches; therefore, in the scenario above b > a. Following is a formal definition of the ‘depends on’ relationship (‘>’) between checkpoints at nodes:

A node’s checkpoint, ckpta > ckptb if there exist messages m and m’ such that receive(m’) belongs to ckpta, send(m) belongs to ckptb, send(m) → receive(m’), where → represents “happened before” relationship.

This means that the ‘depends on’ relationship refers to the timing of the checkpoints. If a checkpoint depends on another checkpoint, then the latter was created at a point in time prior to the former. Simply, this implies that since ckpta > ckptb then if ckpta is in the global checkpoint, then ckptb is also in the global checkpoint in order to maintain consistency of the global checkpoint.

Following is a discussion and proof of this important property for our system model. First, let us define several variables. For a particular workflow instance, assume the following conditions hold:

  1. 1.

    The entire workflow is composed of r MSPs. For any particular MSP j, 1<=j<= r

  2. 2.

    The j-th MSP has n j nodes. For any particular node k, 1<=k<=n j , where nj = the number of nodes in the j-th MSP

Lemma 1

If checkpoint P jk represents checkpoint recorded at the k-th node of the j-th MSP of the workflow instance, it is not possible that P jk depends on checkpoint P j(k+1)

Lemma 1 Proof

Suppose, checkpoint P jk depends on checkpoint P j(k+1).

Since these nodes are subsequent nodes in the same MSP, in following the system model’s rules there must have been transfer of messages—a new-unit-of-work message and an acknowledgement message—between them, and the checkpoints contain records of the send and receive events. We denote the new-unit-of-work message by ‘m’ where the contents of m are as discussed in the system model. Since P jk depends on P j(k+1) :

  1. a)

    There is a record of the event “received m” in P jk

  2. b)

    There is a record of the event “sent m” in the checkpoint P j(k+1) , but not in P jk , and,

  3. c)

    The event “received m” would have to have taken place before the “sent m”, and this is impossible given the rules send(m) → rec(m’).

However (a) and (b) together with the relationship rule, means that send(m) occurred after receive(m’) which violates (c). By contradiction, it is not possible that P jk depends on checkpoint P j(k+1).

Discussion

This proof shows that if we follow the system model rules and each node behaves according to these rules, it would be impossible for an incorrect checkpoint to be written (unless there is a failure). By incorrect, we mean that the checkpoint would contain information from future work that may or may not be completed. This type of anomaly would result in an inconsistent state if we used the checkpoint information as a basis for restarting computation. Further, the system rules can now be unequivocally stated as a requirement for a web services-based docking station or WFE client.

1.3 Step 2: Proof of a globally consistent state

Next, we must show that a set of correct checkpoints can be used to construct a globally consistent state. Let us we more precisely define ‘local’ and ‘global’ checkpoints: A local checkpoint, P jk , is a log written at a node where the node follows the state-transition rules such that a log of the processed unit of work—along with the send and receive messages at that node—are completely written after a message containing that payload is forwarded to the next node in the MSP. Suppose further that P jk includes the next node’s acknowledgement of receipt of the message sent by the current node. Then, a global checkpoint, P JK is the set of all possible local checkpoints relevant to a particular workflow instance being coordinated by a central workflow engine at some point in time.

It follows that the total number of checkpoints for an entire workflow instance is less than or equal to P JK where:

$$ {{\text{P}}_{\text{JK}}} = \sum\limits_{{{\text{j = 1}}}}^{\text{r}} {\sum\limits_{{{\text{k = 1}}}}^{{{{\text{n}}_{\text{j}}}}} {{{\text{P}}_{\text{jk}}}} } . $$

We must show that:

Lemma 2

The collection of correct local checkpoints, P JK , forms a consistent global checkpoint.

Proof

If, by contradiction, we do not have global consistency then:Assume a message m has been sent from a node A to another node B in the same MSP for the same workflow instance such that:

  1. 1.

    The event ‘send m’ is not in the global checkpoint, in other words, A sends m before writing a checkpoint, and then A continues work.

  2. 2.

    Then the event ‘receive m’ is included in the global checkpoint (‘receive m’ is in the log at node B) after node B gets done with its work, at a point in time before node A gets done with its processing, and then A checkpoints.

From 1 and 2 and the algorithm given, this implies that the checkpoint where ‘receive m’ is in the log (node B) < the checkpoint with ‘send m’ in the log (node A). The symbol ‘<’is used to imply that the checkpoint at node B contains less than the total amount of work, and node B’s checkpoint comes before node A’s. (Conclusion I)

Fig. 14

Fig. 14
figure 14

Lemma 2

However, since there is a checkpoint in a node (in the above case, node B) where the ‘receive m’ is in the log, then there would have to be message in the checkpoint of the prior node (in the above case A) containing both a send and that receive, because the state-transition rule states that one can’t send m until the work has been completely processed, and receive m only after the send m. Basically, since the ‘unit of work’ is to be completed sequentially by a node in an MSP for a particular workflow instance, and the cumulative message moves from one node to a next as they were stated in the system model and state transition rule. Hence this implies the checkpoint at node A < the checkpoint at node B. (Conclusion II)

Since conclusions I and II lead are contradicting, it is shown that the state-transition rule results in global consistency of the generated logs (i.e., P JK ). The state-transition rule then forms the foundation of reliable web-services based IOW standards. Based on this rule, several variations can now be developed to accommodate different workflow situations, types of inter-organizational relationships, and to provide a foundation for recovery approach evaluations. Further, higher level web services-based standards embedded in inter-organizational workflow docking stations that follow the rules discussed here, write logs according to the algorithms stated, rely on engines capable of both decomposing workflows into MSPs and can initiate recovery procedures in the cases of node failure, and can recover to the latest feasible point of completion by relying on globally consistent logs.

Appendix B: Abbreviations used

ACID

Atomicity, Consistency, Isolation, Durability

B2B

Business to Business

BPM

Business Process Management

CWFE

Centralized Workflow Engine

FIFO

First in First Out

MSP

Maximal Sequence Path

IOWS

Inter Organizational Workflow Systems

SOA

Service Oriented Architecture

WFE

Workflow Engine

WFI

Workflow Instance

Rights and permissions

Reprints and permissions

About this article

Cite this article

Demirkan, H., Sen, S., Goul, M. et al. Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows. Inf Syst Front 14, 765–788 (2012). https://doi.org/10.1007/s10796-011-9301-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-011-9301-5

Keywords

Navigation