Energy-efficient fault-tolerant replica management policy with deadline and budget constraints in edge-cloud environment

https://doi.org/10.1016/j.jnca.2019.04.018Get rights and content

Abstract

With the development of large-scale distributed systems such as grids and clouds, data replication management has become a hot research topic. Although replica management can improve the cluster system performance, it also brings a series of management and overhead issues. Therefore, the energy-efficient fault-tolerant replica management policy with the deadline and budget constraints in the edge-cloud environment is proposed. The experiments show that the proposed dynamic replica placement algorithm can effectively reduce the mean job time, reduce the use of network bandwidth and improve the utilization of storage space. Considering the issue of energy efficiency, the energy-aware cluster scaling strategy is proposed to reduce system energy consumption and achieve energy efficiency by sleeping and waking up the data nodes according to the load state of the system. Besides, in order to avoid access failures and data loss caused by node failures, the node failure recovery method based on availability metrics is used to deal with node failures. The experiments show that the performance of the proposed algorithm is better than the other algorithms in terms of energy efficiency and fault tolerance.

Introduction

With the development of large-scale distributed systems such as grids and clouds, data replication management has become a hot research topic. Edge computing (Shi et al., 2016) provides services for any computing resources and end-users along the path between cloud data centers. The convergence of edge and cloud computing is the perfect combination including clouds with unlimited shared storage and computing resources and edge computing with low latency data preprocessing. Replica management technology (Yang and Hu, 2018) not only increases data reliability but also improves data access performance and maintains load balancing of the whole system. However, although replica placement can improve the cluster system performance, it also brings a series of management and overhead issues. Most companies are now paying more attention to energy efficiency, considering the high operating costs of their large-scale clusters and storage. The server utilization rate is between 10% and 50% at the most of time, while the energy consumption of the server idle time is more than 60% of the busy time (Barroso and Hölzle, 2007). Reducing the overall energy costs for the server cluster has become a top priority. Therefore, an energy-aware cluster scaling strategy based on the dynamic replica placement is proposed in this paper. The data nodes with low utilization rate in the edge-cloud system are dormant or shut down in order to reduce the energy consumption of the system and achieve energy efficiency.

In the edge-cloud environment (Aujla et al., 2018), users usually only focus on some data blocks in the file, not all. In order to improve performance, the file is divided into different data blocks. These data blocks are stored in different data node to increase data bandwidth by transmitting data in parallel. As the node scale expands, system node failure and hardware failure become a normal state. Node hardware and software failures, power failures and network failures often occur, resulting in inaccessible data blocks. Due to the dynamics, distribution and heterogeneity of the edge-cloud system, a data block access failure may result in the entire file being unavailable. In order to avoid access failures and data loss caused by node failures, data replicas are widely used to ensure data reliability, data availability and network bandwidth utilization by placing data replicas on different nodes. The data replica guarantees that if one data node fails, the data is still available and the service will not be interrupted. Therefore, in order to ensure the reliability and availability of data, the node failure recovery method based on availability metrics is proposed in this paper to solve the problem of node failure in HDFS cluster.

Although the current researches on copy management care about the number of replicas, the allocation of the replicas, and the appropriate time to update replica, they rarely consider both energy efficient and fault-tolerant replica placement. Energy efficient replica placement methods typically include energy-efficient methods based on node scheduling (Enokido et al., 2018; Yang et al., 2017; Yu et al., 2014), energy-efficient methods based on static data placement (Cho et al., 2014; Li et al., 2019a) and energy-efficient methods based on dynamic data placement (Rush and Altiparmak, 2016; Lin and Shen, 2017; Duolikun et al., 2016; Zhao and Wang, 2013; Li et al., 2019b). Most of the existing research (Eischer and Distler, 2018; Guerrero et al., 2018; Shao et al., 2019; Li et al., 2015, 2019c; Higai et al., 2014; Guo et al., 2014) on fault-tolerant replica placement focus on improving the reliability of storage systems, rarely in edge-cloud environments and less research focus on failure recovery. Replica placement affects the fault tolerance of the system in two ways. First, when the nodes fail, if the new node can access more recent redundant data, the repair time will be greatly reduced. Second, the location of redundant data is different, which will greatly affect the fault tolerance of the system. Especially in the case of multiple nodes failing at the same time, reasonable redundant data placement will ensure the system's fault tolerance.

In order to solve the above problem, the energy-efficient fault-tolerant replica management policy with the deadline and budget constraints in the edge-cloud environment is proposed. The output information in the dynamic replica creation module (i.e., the number of replicas created) and the environment information can be regarded as the input information of the replica placement module. With the consideration of the budget and deadline constraints, the proposed replica placement method builds a multi-objective optimization function including the file availability function, the node performance function and so on. Considering the issue of energy efficiency, the energy-aware cluster scaling strategy based on replica placement is proposed to reduce system energy consumption and achieve energy efficiency by sleeping and waking up the data nodes according to the system load state. Besides, in order to avoid access failures and data loss caused by node failures, the node failure recovery method based on availability metrics is used to deal with node failures. The main contributions are summarized as follows.

  • (1)

    The energy-efficient replica management policy is proposed by establishing a multi-objective optimization model with consideration of the budget and deadline constraints. Besides, the energy-aware cluster scaling strategy is proposed to reduce system energy consumption and achieve energy efficiency by sleeping and waking up the data nodes according to the load state of the system.

  • (2)

    The node failure recovery method based on availability metric is proposed to avoid access failures and data loss caused by node failures. When the DataNode fails, the data block availability metric is used to verify whether the failure of the node causes the data block and the file to be unavailable. If the data block is unavailable, the copy backup method is used to restore the corresponding data block.

  • (3)

    The experimental results show that the proposed algorithm can effectively reduce the response time, reduce the network bandwidth usage and improve the storage utilization. Besides, the experiments show that the performance of the proposed algorithm is better than the other algorithms in terms of energy efficiency and fault tolerance.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 shows the model and algorithm for the energy-efficient fault-tolerant replica management policy. Section 4 provides the analysis of experiment results. Conclusions are made in Section 5.

Section snippets

Edge-cloud architecture

As the fast-growing new technology, cloud computing has completely changed the products and development models of traditional IT enterprises. With the increase of application scenarios, the shortcomings of the traditional concentrated cloud and distributed cloud architecture are gradually exposed. The centralized cloud has the disadvantages of high concentration of network load and poor robustness (Jemaa et al., 2017). The distributed cloud has the disadvantages of high system complexity and

Energy-efficient fault-tolerant replica management with deadline and budget

As shown in Fig. 1, there are the central cloud, the edge cloud and terminal in the edge-cloud architecture. The central cloud manages and monitors resources in each edge cloud which consists of edge nodes (EN) distributed in different networks and regions. The edge cloud is used to process user requests and provide cloud services to users. The terminals are used by users to request and obtain cloud services. In the edge cloud, based on the number of replicas obtained in the central cloud, the

Experimental environment

  • (1)

    Experimental Setup

The experiment is running on Ubuntu 14.04 LTS operating system. Hadoop 2.7.1 version is installed on it with default settings. We use the 64-bit Linux machine running Eclipse 4.5 with JDK1.7. Besides, the database of the experiment is on MySQL versions 5.0. The Edge-Cloud environment includes central cloud and edge cloud where the edge cloud consists of a NameNode and a Hadoop cluster built by 10 different DataNodes. The edge and central cloud are connected via VPN. The

Conclusion

In this paper, the energy-efficient fault-tolerant replica management policy with deadline and budget constraints in edge-cloud environment is proposed. Experiments show that the proposed dynamic replica creation algorithm can respond to the situation of burst-access data blocks by creating multiple replicas and dynamically adjust the number of replicas according to the access frequency of data blocks. With the consideration of the budget and deadline constraints, the proposed replica placement

Acknowledgment

The work was supported by the National Natural Science Foundation (NSF) under grants (No. 61672397, No. 61873341, No. 61771354), Application Foundation Frontier Project of WuHan (No. 2018010401011290). The Fundamental Research Funds for the Central Universities (No. 2018-YS-063), Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources (Grant No. KF-2018-03-005), Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food

Chunlin Li is a Professor of Computer Science in Wuhan University of Technology. She received the ME in Computer Science from Wuhan Transportation University in 2000, and PhD in Computer Software and Theory from Huazhong University of Science and Technology in 2003. Her research interests include cloud computing and distributed computing.

References (33)

  • M. Eischer et al.

    Latency-aware leader selection for geo-replicated byzantine fault-tolerant systems

  • T. Enokido et al.

    An energy-efficient process replication algorithm based on the active time of cores

  • C. Guerrero et al.

    Migration-aware genetic optimization for MapReduce scheduling and replica placement in hadoop

    J. Grid Comput.

    (2018)
  • W. Guo et al.

    A workload-based dynamic adaptive data replica placement method

  • A. Higai et al.

    A study of replica reconstruction schemes for multi-rack HDFS clusters

  • F.B. Jemaa et al.

    QoS-aware VNF placement optimization in edge-central carrier cloud architecture

  • Cited by (28)

    • A multi-objective optimized replication using fuzzy based self-defense algorithm for cloud computing

      2020, Journal of Network and Computer Applications
      Citation Excerpt :

      Therefore, replicas and requests should be reasonably distributed to achieve high cloud performance and the optimal use of resources. Replica placement and managing such widely distributed data becomes more complicated when compared to a small-scale environment (Li et al., 2019b; Mansouri and Javidi, 2020). For example, if replicas are randomly distributed then it is possible that some data centers are very busy while some others are idle.

    • Resource and replica management strategy for optimizing financial cost and user experience in edge cloud computing system

      2020, Information Sciences
      Citation Excerpt :

      Furthermore, to guarantee the data reliability, the replica inconsistency problem is solved by detecting the log information. Meanwhile, comparing with the previous works in [8,9,18,19], the differences are described, successively, as follows. In [8], the replica allocation problem is solved based on the power consumption, the physical resource waste and the file unavailability in Hadoop architecture.

    • Load balance based workflow job scheduling algorithm in distributed cloud

      2020, Journal of Network and Computer Applications
      Citation Excerpt :

      The comparisons between our work and current job scheduling methods are listed in Table 1. At present, the goal of related research on task scheduling in the geo-distributed cloud is mainly about reducing data transmission traffic between different data centers and shortening the entire work completion time (Li et al., 2019a, 2019b, 2019c). Hu et al. (2018) implemented Flutter, a new task scheduling algorithm that can reduce the completion time and network cost of big data processing jobs across geographically distributed data centers.

    • Flexible fault tolerance in cloud through replicated cooperative resource group

      2019, Computer Communications
      Citation Excerpt :

      Therefore, in this paper, we quantize the reliability into five fuzzy sets to address the issue. Chunlin et al. (2019) suggested the replication group organization in terms of deadline and budget [34]. In the present research, we also consider deadline of the task while selecting resources for inclusion in the CRG.

    View all citing articles on Scopus

    Chunlin Li is a Professor of Computer Science in Wuhan University of Technology. She received the ME in Computer Science from Wuhan Transportation University in 2000, and PhD in Computer Software and Theory from Huazhong University of Science and Technology in 2003. Her research interests include cloud computing and distributed computing.

    Yaping Wang received her BS degree in Medical information engineering from Hubei University of Traditional Chinese Medicine in 2017. She now is a MS student in Wuhan University of Technology. Her research interests include cloud computing and big data.

    Yi Chen received his PhD degree in Beijing Institute of Technology. She is now a Professor of school of computer and information engineering in Beijing Technology and Business University. Her research interests include Information Visualization and Visual Analytics, Intelligent Data Processing, Big Data Technology for Food Safety, Data Mining and Machine Learning. She has published more than 90 papers.

    Youlong Luo is a vice Professor of Management at Wuhan University of Technology. He received his M.S. in Telecommunication and System from Wuhan University of Technology in 2003 and his Ph.D. in Finance from Wuhan University of Technology in 2012. His research interests include cloud computing and electronic commerce.

    View full text