Reliability and cost optimization in distributed computing systems

doi:10.1016/S0305-0548(02)00058-8

Computers & Operations Research

Volume 30, Issue 8, July 2003, Pages 1103-1119

https://doi.org/10.1016/S0305-0548(02)00058-8 Get rights and content

Abstract

The reliability of the communication network and its processing units and the strategy of task allocation are essential in determining the system reliability of a distributed computing system. Reliability of such systems can be improved by endowing resource redundancy or the use of highly reliable components. In this paper, we develop a relationship between system cost and hardware redundancy levels, assuming cycle-free distributed computing systems. Based on the derived relationship, we propose a hybrid heuristic which combines genetic algorithms and the steepest decent method to seek the optimal task allocation and hardware redundancy policies such that system cost is minimized.

Scope and purpose

The purpose of this paper is to develop the optimal task allocation and hardware redundancy policies for a cycle-free distributed computing system with hardware redundancy so that system cost during the period of task execution is minimized.

Introduction

In the context of computer networking, distributed computing systems (DCSs) are more favorable than centralized ones for several aspects such as information sharing, concurrent processing and operational capabilities [1], [2]. Yet, a DCS must be reliable to be successful. The system reliability of a DCS can be measured by the probability of the successful execution of its tasks, such as programs and software, during the period of task execution. Achieving a reliable DCS thus comprises three parts: the realization of reliable communication network, reliable task processing and a good task-allocation policy.

A distributed computing system consists of a set of processing nodes, among which communication links are established. A DCS is redundant if it possesses software redundancy (e.g. data replication among processing nodes) and/or hardware redundancy (e.g. multiple processors at a processing node, and multiple communication links connecting each pair of processing nodes). To improve the system reliability of a DCS when system redundancy is not available or infeasible, the main focus is to allocate the tasks to the processing nodes in a DCS so as to maximize the system reliability [3], [4]. On the other hand, when system redundancy is available, system reliability can be gained by increasing software redundancy and/or hardware redundancy, along with the deployment of a task-allocation strategy. In the literature on software redundancy, data redundancy is considered in a distributed database system and algorithms are proposed to seek the minimal data replication while retaining high system reliability [5]. In [6], [7], file redundancy is taken into account and the reliability problem is formulated in order to reduce the total communication cost. Algorithms based on the Lagrangian relaxation and subgradient methods are then used to solve this problem. For distributed database systems with ring topology, replicated objects are optimally allocated to achieve high system reliability [8]. Since the reliability problem for a general DCS with software redundancy is highly complicated, reliability evaluation is usually computationally expensive. Thus, several efficient algorithms for DCSs of different topologies are proposed in [9], [10], [11], [12], [13], [14].

Alternatively, system reliability can be improved by increasing hardware redundancy [15], [16], [17], [18], [19]. In [17], [18], models for task allocation under different hardware redundancy levels are presented to maximize the system reliability. Due to the exponential nature of computations, the problem is solved using the state-space search-tree algorithm. The study is further improved in [15], [16] where a branch-and-bound algorithm is proposed and the intra-communication of modules is considered in task-allocation sequence, both of which reduce the average computational effort of task allocation significantly. Verma and Tamhankar [19] adopt the task-allocation models in [18] and propose a branch-and-bound algorithm for solving the multiple-join problem in distributed database systems. In these studies, however, the hardware redundancy level is treated as a constant. Therefore, the system reliability of a DCS found is mainly for a particular hardware redundancy level. The dynamics between the system reliability of a DCS and the hardware redundancy level is yet unknown and deserves some investigation.

In this study, cycle-free DCSs with hardware redundancy are considered. Following the work in [18], this study also considers DCSs with homogeneous hardware redundancy, i.e. processing nodes and communication links being of the same hardware redundancy level. It is intuitive that the higher the hardware redundancy level, the higher the system reliability. In practical situations in which a DCS concerned is cost-sensitive or budget-limited, however, endowment of infinite hardware redundancy levels to gain system reliability is infeasible. A unified model of system cost is thus proposed in this paper which takes into account system reliability as well as the hardware redundancy level. The relationship between system cost and the hardware redundancy level is first derived for a given task assignment, and based on the derived relationship a hybrid heuristic which combines genetic algorithms and the steepest decent method is then developed to seek the optimal task allocation and hardware redundancy policies such that system cost is minimized.

The remainder of this paper is organized as follows. Section 2 describes the reliability analysis for cycle-free hardware-redundant distributed computing systems. In Section 3, a unified model of system cost is presented and the non-linear behavior of system cost as the hardware redundancy level changes is quantified. In Section 4, a hybrid algorithm that combines genetic algorithms (for allocating a task) and the steepest decent method (for finding the optimal hardware redundancy level for a given task allocation) is developed to solve the system cost minimization problem. Experimental studies are conducted in Section 5, followed by a brief summary in Section 6.

Section snippets

Model of system reliability

Let $T ={t_{i}, i=1..m}$ denote a task, consisting of a set of modules, to be executed on a distributed computing system where m is the size of $T$ . Let $P ={P_{k}, k=1..n}$ be the set of processing nodes in a DCS where n is the number of processing nodes, and let $p_{k}$ be a processor at the processing node $P_{k}$ . The communication link connecting two processing nodes $P_{p}$ and $P_{q}$ is represented by $l_{p,q}$ , the transmission rate of which is w_p,q; and the communication path between two nodes $P_{i}$ and $P_{j}$ , not necessary

System cost

As shown in Eq. (7), maximum reliability of a distributed computing system for a fixed redundancy level can be achieved by finding the optimal task assignment matrix. With additional endowment of hardware redundancy, the DCS becomes more reliable, hence reducing average execution cost more significantly in the long run. Yet, such endowment increases system cost (such as hardware deployment cost, maintenance cost, etc.). This trade-off between system cost and the hardware redundancy level is

Cost minimization

As system cost is either increasing or unimodal with respect to the hardware redundancy level for a given task assignment (see Appendix A), the optimal hardware redundancy level for a given task assignment can be determined uniquely using a local search method. Therefore, instead of solving (10) directly, we transform (10) into the following form: $min X z(X,r_{X}),$ where $z(X,r_{X})$ is the minimal system cost under task assignment $X$ and $r_{X}$ is the corresponding optimal hardware redundancy level.

To obtain $z($

Experimental results

In the simulation, three test problems are formed by allocating a task of four modules onto three cycle-free distributed computing systems of different sizes. Two scenarios are examined for each test problem:

S1:
no more than one module can be assigned to a processing node during task execution; and
S2:
no more than two modules can be assigned to a processing node during task execution.

Ten simulation runs are conducted for each scenario, and for each simulation run the minimum and maximum hardware

Summary

In this study, the optimal task allocation and hardware redundancy policies are examined for a cycle-free hardware-redundant distributed computing system. A model of system cost for a DCS is proposed which takes into account the sources of cost due to system reliability and hardware redundancy. It has been established that system cost is either increasing or unimodal with respect to the hardware redundancy level for a given task assignment. A hybrid algorithm which combines genetic algorithms

Chung-Chi Hsieh is an associate professor in the Department of Industrial Management Science at National Cheng Kung University, Tainan, Taiwan. He completed his Ph.D. in Industrial Operations Engineering at the University of Michigan in 1997. His research interests are in computer-aided design, distributed computing systems and electronic commerce.

References (24)

M.S. Chang et al.
The distributed program reliability analysis on star topologies
Computers and Operations Research
(2000)
C.S. Raghavendra et al.
Reliability optimization in the design of distributed systems
IEEE Transactions on Reliability
(1985)
A.S. Tanenbaum
Distributed operating systems
(1995)
P.Y. Richard et al.
A task allocation model for distributed computing systems
IEEE Transactions on Computers
(1982)
C.C. Shen et al.
A graph matching approach to optimal task assignment in distributed computing systems under a minmax criterion
IEEE Transactions on Computers
(1985)
Schloss GA, Stonebraker M. Highly redundant management of distributed data. In: Proceedings of Workshop on the...
Chiu GM, Raghavendra CS. A model for optimal resource allocation in distributed computing systems. In: Proceedings of...
Chiu GM, Raghavendra CS. A model for optimal database allocation in distributed computing systems. In: Proceedings of...
A.B. Stephens et al.
Optimal allocation for partially replicated database systems on ring networks
IEEE Transactions on Knowledge and Data Engineering
(1994)
Chang PY, Chen DJ. Optimal routing for distributed computing systems with data replication. In: Proceedings of the IEEE...

C.M. Chen et al.

Reliability issues with multiprocessor distributed database systems: a case study

IEEE Transactions on Reliability

(1989)

Lin MS, Chang DJ. The reliability problem in distributed database systems. In: Proceedings of the 1997 International...

Cited by (81)

A new median-average round Robin scheduling algorithm: An optimal approach for reducing turnaround and waiting time
2022, Alexandria Engineering Journal
A variety of algorithms handles processes on the CPU. The round-robin algorithm is an efficient CPU scheduling mechanism for a time-sharing operating system. The system processes the methods based on the time slice; however, determining the time slice has proven highly challenging for the researchers. As a result, a variety of dynamic time quantum scheduling techniques are presented by various academics to address this challenge. This study aims to determine how to best schedule resources to maximize efficiency. It is important to note that this scheduling mechanism rotates between the processes after the static quantum time is complete. However, the quantum decision affects how effectively and efficiently the procedures may be scheduled. Additionally, the quantum decision has an impact on the scheduling of processes. The average waiting time, turnaround time, and context switch times of the Round Robin scheduling algorithm are high enough to influence the system's performance. To get over the round-drawbacks, robin's the authors in this study suggest using the improved algorithm Median-Average Round Robin (MARR). Using the median and average of the burst time of each process, the author proposes a dynamic time quantum for the system. The authors compared the proposed model with four other scheduling algorithms. The results vividly depict that the proposed algorithms successfully give effective results with reduced average turnaround time and waiting time. In the future, cost and RAM utilization will be considered to enhance the algorithm.
Reliability and availability analysis of a standby system with activation time and varying demand
2022, Engineering Reliability and Risk Assessment
Redundancy is the frequent method used to enhance the performance of any industrial system. However, it adds additional time and cost. A cold redundant unit requires a finite activation time when operated. The chapter presents the modeling and evaluation of different measures such as transition probabilities, mean time to system failure (MTSF), reliability, and availability analysis for a two-unit cold redundant system with varied demand. The conventional reliability definitions and principles such as regenerative point technique, Markov processes are used as the basics. Different cases of availability are discussed depending upon the demand. Such an analysis has not been done so far. Various graphs and results show the behavior of MTSF, availability with respect to different parameters. The developed model is general, so can be used by any industry/manufacturing plant using the same concept. As an example, “General Cable Energy System” Baddi in Himachal Pradesh, India, creating roughly 150 km of cable wire per day, has been considered to illustrate the proposed approach.
Optimal defense of a distributed data storage system against hackers’ attacks
2020, Reliability Engineering and System Safety
Citation Excerpt :
Some works studied the combination of the above two measures. The optimal strategy of hardware redundancies deployment and data replication was studied in [14] and [15], where a hybrid heuristic algorithm is developed to seek the optimal solution. In addition to intelligent algorithms, simulation is also an effective method to analyze such large and complex systems [16,17].
Distributed data storage systems are widely used to store data with the development of big data and cloud computing. Due to the complicated network environment, the data may risk being destroyed or stolen by an intentional hacker. In order to mitigate the risk of data destruction and data theft, this paper studies the joint optimization of data parts allocation and computers protections to defense a distributed data storage system in order to maximize the system reliability. The whole data is divided into multiple data parts, where copies of each part can be made and allocated onto different computers. Two different cases are considered, where all data parts on a computer will be destroyed in the first case and all data parts on a computer will not only be destroyed but also stolen in the second case, if the computer is intruded by the hacker. The system is reliable if the defender still has all data parts and the hacker does not obtain all data parts. For both cases, the system reliability is evaluated by extending the universal generating function technique. Numerical examples are carried out to illustrate the applications.
Load balanced scheduling and reliability modeling of grid transaction processing system using colored Petri nets
2019, ISA Transactions
Citation Excerpt :
The reliability, i.e. probability that an on-demand computing based transaction can run successfully, depends on the reliabilities of communication links, softwares (middleware or distributed operating system), and processing elements as well as on the distribution of files. In order to have a meaningful and effective analysis of the system, the modeling or representation must be easy for fault detection and diagnosis techniques [11–15]. Modeling of any system allows the designers to inspect and learn about the behavior of the system prior to implementation.
On-demand computing is a popular enterprise model in which the computing resources are made available to the users as needed. On-demand computing based transaction processing system which has grown rapidly in recent years is an information processing system with the stringent requirements of resources to meet the fluctuating demands. Concepts such as grid computing, utility computing, autonomic computing, and adaptive management seem very similar to the concept of on-demand computing. When demands of resources fluctuate, the system needs load balancing for the efficient utilization of the computational resources. Furthermore, scheduling is needed to assign the transactions to the appropriate resources. Thus, modeling of load balanced scheduling along with reliability analysis for this system is a challenging task.
This paper presents the load balanced scheduling and reliability modeling in such an environment by using colored Petri nets (CPNs). CPNs which combine Petri nets with programming languages is a powerful modeling technique. The proposed CPN-based modeling pattern formally describes the process of transaction distribution and execution within the on-demand computing environment. Moreover, the CPN-based model uses the hierarchical modeling capability of CPNs, including different levels of abstraction (sub-modules). This helps easily handling and extending the model. Since, on-demand computing based transaction processing system executes a number of concurrent transactions. The CPN-based model is extended to express the concurrency, thus improving the reliability results. This paper takes the example of grid transaction processing (GTP) system with the problem of load balanced scheduling modeling and reliability evaluation.
On maximizing reliability of grid transaction processing system considering balanced task allocation using social spider optimization
2018, Swarm and Evolutionary Computation
This paper deals with the problem of task allocation in the grid transaction processing system. There has been quite some research on the development of tools and techniques for grid computing systems, yet some important issues, e.g., service reliability with load balanced transaction allocation in grid computing system, have not been sufficiently studied. Load balanced transaction allocation becomes a challenging job in such a complex and dynamic environment as both the application and computational resources are heterogeneous. The problem is further complicated by the fact that these resources may fail at any point of time. The problem of finding an optimal task allocation solution is known to be an NP-hard. We propose grid transaction allocation based on social spider optimization (LBGTA_SSO) method for this problem. The LBGTA_SSO is based on the cooperative behavior of social spiders to find a collection of task allocation solutions. We also derive reliability formulae for grid transactions considering resource availability. For comparison we modify some existing algorithms to obtain the task allocation algorithms. The results show that our algorithm works better than the modified existing algorithms.
Optimal backup in heterogeneous standby systems exposed to shocks
2017, Reliability Engineering and System Safety
The paper considers non-repairable 1-out-of-N heterogeneous warm standby computing systems with components exposed to internal failures and external shocks. To provide the data recovery in the case of operating component failure, the backup procedures are performed during the computational mission. The backups enable an activated standby component to take over the mission task from the point where the last backup has been completed without redoing the entire task from scratch. Both data backup and retrieval times depend on the amount of work performed. The system components are characterized by a different performance level, replacement time, time-to-internal failure distribution, and shocks survival probability. The shock processes also have different characteristics for different components. A numerical method is proposed to evaluate mission success probability for a given allowed mission time and expected mission completion time. The optimal backup scheduling problem is then formulated and solved for different optimization objectives and constraints.

View all citing articles on Scopus

Yi-Che Hsieh obtained his M.S. at National Cheng Kung University, Tainan, Taiwan. The areas of his research are in database management, distributing computing, and network design.

View full text

Reliability and cost optimization in distributed computing systems

Abstract

Scope and purpose

Introduction

Section snippets

Model of system reliability

System cost

Cost minimization

Experimental results

Summary

Computers and Operations Research

Reliability optimization in the design of distributed systems

IEEE Transactions on Reliability

Distributed operating systems

A task allocation model for distributed computing systems

IEEE Transactions on Computers

A graph matching approach to optimal task assignment in distributed computing systems under a minmax criterion

IEEE Transactions on Computers

Optimal allocation for partially replicated database systems on ring networks

IEEE Transactions on Knowledge and Data Engineering

Reliability issues with multiprocessor distributed database systems: a case study

IEEE Transactions on Reliability