Reliability and cost optimization in distributed computing systems
Introduction
In the context of computer networking, distributed computing systems (DCSs) are more favorable than centralized ones for several aspects such as information sharing, concurrent processing and operational capabilities [1], [2]. Yet, a DCS must be reliable to be successful. The system reliability of a DCS can be measured by the probability of the successful execution of its tasks, such as programs and software, during the period of task execution. Achieving a reliable DCS thus comprises three parts: the realization of reliable communication network, reliable task processing and a good task-allocation policy.
A distributed computing system consists of a set of processing nodes, among which communication links are established. A DCS is redundant if it possesses software redundancy (e.g. data replication among processing nodes) and/or hardware redundancy (e.g. multiple processors at a processing node, and multiple communication links connecting each pair of processing nodes). To improve the system reliability of a DCS when system redundancy is not available or infeasible, the main focus is to allocate the tasks to the processing nodes in a DCS so as to maximize the system reliability [3], [4]. On the other hand, when system redundancy is available, system reliability can be gained by increasing software redundancy and/or hardware redundancy, along with the deployment of a task-allocation strategy. In the literature on software redundancy, data redundancy is considered in a distributed database system and algorithms are proposed to seek the minimal data replication while retaining high system reliability [5]. In [6], [7], file redundancy is taken into account and the reliability problem is formulated in order to reduce the total communication cost. Algorithms based on the Lagrangian relaxation and subgradient methods are then used to solve this problem. For distributed database systems with ring topology, replicated objects are optimally allocated to achieve high system reliability [8]. Since the reliability problem for a general DCS with software redundancy is highly complicated, reliability evaluation is usually computationally expensive. Thus, several efficient algorithms for DCSs of different topologies are proposed in [9], [10], [11], [12], [13], [14].
Alternatively, system reliability can be improved by increasing hardware redundancy [15], [16], [17], [18], [19]. In [17], [18], models for task allocation under different hardware redundancy levels are presented to maximize the system reliability. Due to the exponential nature of computations, the problem is solved using the state-space search-tree algorithm. The study is further improved in [15], [16] where a branch-and-bound algorithm is proposed and the intra-communication of modules is considered in task-allocation sequence, both of which reduce the average computational effort of task allocation significantly. Verma and Tamhankar [19] adopt the task-allocation models in [18] and propose a branch-and-bound algorithm for solving the multiple-join problem in distributed database systems. In these studies, however, the hardware redundancy level is treated as a constant. Therefore, the system reliability of a DCS found is mainly for a particular hardware redundancy level. The dynamics between the system reliability of a DCS and the hardware redundancy level is yet unknown and deserves some investigation.
In this study, cycle-free DCSs with hardware redundancy are considered. Following the work in [18], this study also considers DCSs with homogeneous hardware redundancy, i.e. processing nodes and communication links being of the same hardware redundancy level. It is intuitive that the higher the hardware redundancy level, the higher the system reliability. In practical situations in which a DCS concerned is cost-sensitive or budget-limited, however, endowment of infinite hardware redundancy levels to gain system reliability is infeasible. A unified model of system cost is thus proposed in this paper which takes into account system reliability as well as the hardware redundancy level. The relationship between system cost and the hardware redundancy level is first derived for a given task assignment, and based on the derived relationship a hybrid heuristic which combines genetic algorithms and the steepest decent method is then developed to seek the optimal task allocation and hardware redundancy policies such that system cost is minimized.
The remainder of this paper is organized as follows. Section 2 describes the reliability analysis for cycle-free hardware-redundant distributed computing systems. In Section 3, a unified model of system cost is presented and the non-linear behavior of system cost as the hardware redundancy level changes is quantified. In Section 4, a hybrid algorithm that combines genetic algorithms (for allocating a task) and the steepest decent method (for finding the optimal hardware redundancy level for a given task allocation) is developed to solve the system cost minimization problem. Experimental studies are conducted in Section 5, followed by a brief summary in Section 6.
Section snippets
Model of system reliability
Let denote a task, consisting of a set of modules, to be executed on a distributed computing system where m is the size of . Let be the set of processing nodes in a DCS where n is the number of processing nodes, and let be a processor at the processing node . The communication link connecting two processing nodes and is represented by , the transmission rate of which is wp,q; and the communication path between two nodes and , not necessary
System cost
As shown in Eq. (7), maximum reliability of a distributed computing system for a fixed redundancy level can be achieved by finding the optimal task assignment matrix. With additional endowment of hardware redundancy, the DCS becomes more reliable, hence reducing average execution cost more significantly in the long run. Yet, such endowment increases system cost (such as hardware deployment cost, maintenance cost, etc.). This trade-off between system cost and the hardware redundancy level is
Cost minimization
As system cost is either increasing or unimodal with respect to the hardware redundancy level for a given task assignment (see Appendix A), the optimal hardware redundancy level for a given task assignment can be determined uniquely using a local search method. Therefore, instead of solving (10) directly, we transform (10) into the following form:where is the minimal system cost under task assignment and is the corresponding optimal hardware redundancy level.
To obtain
Experimental results
In the simulation, three test problems are formed by allocating a task of four modules onto three cycle-free distributed computing systems of different sizes. Two scenarios are examined for each test problem:
- S1:
no more than one module can be assigned to a processing node during task execution; and
- S2:
no more than two modules can be assigned to a processing node during task execution.
Ten simulation runs are conducted for each scenario, and for each simulation run the minimum and maximum hardware
Summary
In this study, the optimal task allocation and hardware redundancy policies are examined for a cycle-free hardware-redundant distributed computing system. A model of system cost for a DCS is proposed which takes into account the sources of cost due to system reliability and hardware redundancy. It has been established that system cost is either increasing or unimodal with respect to the hardware redundancy level for a given task assignment. A hybrid algorithm which combines genetic algorithms
Chung-Chi Hsieh is an associate professor in the Department of Industrial Management Science at National Cheng Kung University, Tainan, Taiwan. He completed his Ph.D. in Industrial Operations Engineering at the University of Michigan in 1997. His research interests are in computer-aided design, distributed computing systems and electronic commerce.
References (24)
- et al.
The distributed program reliability analysis on star topologies
Computers and Operations Research
(2000) - et al.
Reliability optimization in the design of distributed systems
IEEE Transactions on Reliability
(1985) Distributed operating systems
(1995)- et al.
A task allocation model for distributed computing systems
IEEE Transactions on Computers
(1982) - et al.
A graph matching approach to optimal task assignment in distributed computing systems under a minmax criterion
IEEE Transactions on Computers
(1985) - Schloss GA, Stonebraker M. Highly redundant management of distributed data. In: Proceedings of Workshop on the...
- Chiu GM, Raghavendra CS. A model for optimal resource allocation in distributed computing systems. In: Proceedings of...
- Chiu GM, Raghavendra CS. A model for optimal database allocation in distributed computing systems. In: Proceedings of...
- et al.
Optimal allocation for partially replicated database systems on ring networks
IEEE Transactions on Knowledge and Data Engineering
(1994) - Chang PY, Chen DJ. Optimal routing for distributed computing systems with data replication. In: Proceedings of the IEEE...
Reliability issues with multiprocessor distributed database systems: a case study
IEEE Transactions on Reliability
Cited by (81)
A new median-average round Robin scheduling algorithm: An optimal approach for reducing turnaround and waiting time
2022, Alexandria Engineering JournalReliability and availability analysis of a standby system with activation time and varying demand
2022, Engineering Reliability and Risk AssessmentOptimal defense of a distributed data storage system against hackers’ attacks
2020, Reliability Engineering and System SafetyCitation Excerpt :Some works studied the combination of the above two measures. The optimal strategy of hardware redundancies deployment and data replication was studied in [14] and [15], where a hybrid heuristic algorithm is developed to seek the optimal solution. In addition to intelligent algorithms, simulation is also an effective method to analyze such large and complex systems [16,17].
Load balanced scheduling and reliability modeling of grid transaction processing system using colored Petri nets
2019, ISA TransactionsCitation Excerpt :The reliability, i.e. probability that an on-demand computing based transaction can run successfully, depends on the reliabilities of communication links, softwares (middleware or distributed operating system), and processing elements as well as on the distribution of files. In order to have a meaningful and effective analysis of the system, the modeling or representation must be easy for fault detection and diagnosis techniques [11–15]. Modeling of any system allows the designers to inspect and learn about the behavior of the system prior to implementation.
On maximizing reliability of grid transaction processing system considering balanced task allocation using social spider optimization
2018, Swarm and Evolutionary ComputationOptimal backup in heterogeneous standby systems exposed to shocks
2017, Reliability Engineering and System Safety
Chung-Chi Hsieh is an associate professor in the Department of Industrial Management Science at National Cheng Kung University, Tainan, Taiwan. He completed his Ph.D. in Industrial Operations Engineering at the University of Michigan in 1997. His research interests are in computer-aided design, distributed computing systems and electronic commerce.
Yi-Che Hsieh obtained his M.S. at National Cheng Kung University, Tainan, Taiwan. The areas of his research are in database management, distributing computing, and network design.