Fast algorithms for restoring survivable spanning connection

https://doi.org/10.1016/j.compeleceng.2021.107643Get rights and content

Highlights

  • A problem is proposed to restore survivable spanning connection with link failures.

  • Survivable spanning connection is restored by updating its faulty spanning trees.

  • Two algorithms are proposed for shared and unshared link failures, respectively.

  • The proposed algorithms can significantly reduce the recovery time.

  • The proposed algorithms can achieve nearly optimal network survivability.

Abstract

The stability of data transmission in the network can be maintained by a survivable spanning connection (SSC), which consists of two not-fully-disjoint spanning trees. Existing works restore the faulty SSC from link failures by regenerating two new not-fully-disjoint spanning trees. However, it takes plenty of recovery time, especially in a large network. This paper aims to fastly restore the faulty SSC by adjusting and updating its faulty spanning trees. A SSC restoration problem is formulated to generate a restored SSC with nearly optimal network survivability and desired bandwidth requirement. To solve the formulated problem, two fast restoration algorithms are proposed to cope with the shared link failure and the unshared link failure, respectively. Simulation results show that, the proposed algorithms can reduce the recovery time by 36.69% and 55.54%, respectively, compared with the existing algorithm. In addition, the proposed algorithms can achieve nearly optimal network survivability.

Introduction

The transmission capacity in Ethernet has reached 100 Gb/s and beyond at present, and it will achieve almost 400 Gb/s in the near future [1]. With the rapid improvement of transmission capacity, an increasing amount of data can be transmitted in the network. However, a vast amount of data will be lost, when failures occur in the network. This results in huge disruption to business services and poor quality of service (QoS) to users [2]. Therefore, the failures must be recovered promptly once they occur in the network.

Link failure is one of the common failures in the network [3]. Each link has a failure probability, caused by natural disasters, intentional attacks and so on [4], [5]. Network survivability is the ability of the network to continuously transmit data, under the given requirements in the presence of failures and other undesired events [6], [7]. Thus, existing works [8], [9] have widely investigated network survivability based recovery schemes to cope with link failures. Meanwhile, most of the existing works [10], [11] have studied the single link failure model, which is designed to handle single failure events. The reasons for these works to investigate this model are as follows. First, addressing single link failures is a common requirement in various survivability standards [12], [13]. Second, the analysis for the single link failure model can provide insights for handling multiple link failures.

Recovery schemes for single link failures can be distinguished into two categories, i.e., restoration scheme [14], [15], [16] and protection scheme [17], [18], [19]. In the first scheme, the post-failure actions are performed to recover data transmission from link failures. Authors in [14] investigated a problem of provisioning QoS paths, under the constraints of bottleneck QoS and additive QoS. They also proposed an approximation algorithm to obtain a primary QoS path, and to construct a restoration topology for protecting the primary path from link failures. Authors in [16] proposed an algorithm based on ant colony and adjacent shortest cycle, to find an alternative path for data retransmission in the optical network. In the second scheme, the pre-failure actions are performed to establish the backup solutions for link failures in advance. Authors in [17] presented a coding-aware virtual network mapping framework, to solve the survivable virtual network mapping problem. The framework consists of two coding mechanisms based on different link mapping algorithms. Both two mechanisms can exploit less redundant resources to achieve instantaneous service recovery. Authors in [18] proposed an efficient polynomial-time approximation algorithm to optimize the network survivability, under the constraints of additive end-to-end QoS constraints. Moreover, the protection scheme can deal with link failures faster than the restoration scheme, as the protection scheme derives the backup solutions in advance. However, the algorithms proposed in the aforementioned works are appropriate for recovering unicast transmission between a pair of nodes from a faulty link, but they are not suitable for recovering broadcast transmission from link failures.

It is well-known that the spanning tree can achieve data transmission between two nodes in the network at the low communication overhead. Currently, several spanning tree protocols have been proposed to maintain the stability of broadcast transmission. The first protocol is the spanning tree protocol [20], in which only a spanning tree is built to protect the network against failures. But, it takes a lengthy period of 30s–60s for this protocol to recover data transmission from failures [21]. In order to reduce data loss, recovery time should be no more than 50 ms [12], [13]. Thus, the spanning tree protocol cannot meet the requirement on recovery time well. Accordingly, the fast spanning tree (FST) protocol [22] has been introduced to reduce the recovery time. Authors in [23] proposed a FST reconnection algorithm to recover data transmission in the metro Ethernet network. The reconnection algorithm uses the pre-configured link to reconnect the broken spanning tree. Authors in [24] proposed a heuristic algorithm to solve the partial spatial protection problem. The heuristic algorithm can provide differentiated reliability at the spanning tree level. In addition, some existing works [25], [26], [27] exploited the multiple spanning tree (MST) protocol [28] to further accelerate the recovery period. In [25], authors designed a MST architecture to improve the failure resiliency and throughput for metropolitan areas and cluster networks. In [26], authors proposed a distributed and fast MST based restoration mechanism, to recover the link failures in the metro Ethernet network.

MST based recovery schemes can achieve reliable data transmission in the network, by establishing multiple fully-disjoint spanning trees (MFSTs) to transmit data. In MFSTs, there is no shared link. Although the employment of MFSTs can offer full network survivability, the fully-disjoint requirement is too strict. Specifically, MFSTs with the desired QoS guarantee may not exist, and thus the employment of MFSTs may become unfeasible. Therefore, the fully-disjoint requirement must be relaxed. A novel concept of tunable survivability has been introduced in [29], and it can offer a quantitative measurement to specify the desired degree of network survivability. This implies that network survivability can be specified in the range of 0% to 100%. Authors in [30] combined tunable survivability with multiple not-fully-disjoint spanning trees (MNSTs), to maintain the stability of broadcast transmission under the desired network survivability. They defined that network survivability is related to the failure probability of shared links between MNSTs. They also claimed that, more MNSTs can increase network survivability, when the number of spanning trees is less than a certain number. But, transmitting data in the network through more MNSTs imposes a high management overhead to the network. Therefore, it is significant to make a trade-off between the number of not-fully-disjoint spanning trees, network survivability and management overhead. To this end, the work [30] verified that two not-fully-disjoint spanning trees can achieve optimal network survivability with less management overhead. The combination of these two not-fully-disjoint spanning trees is defined as a survivable spanning connection (SSC). In SSC, one spanning tree is used to transmit data, while the other spanning tree is a backup solution. When the faulty link is an unshared link, the backup spanning tree is activated to continuously transmit data, and the broken spanning tree is discarded. When the faulty link is a shared link, both spanning trees in SSC break down, and thus data transmission is interrupted. For both two cases, authors in [30] formed a new SSC by regenerating two new not-fully-disjoint spanning trees. However, it takes more recovery time, especially in a large network.

In this paper, we aim to fastly restore the faulty SSC, both for the shared link failure and for the unshared link failure, by adjusting and updating its faulty spanning trees. Meanwhile, we aim to obtain a restored SSC with nearly optimal network survivability and desired bandwidth capacity. To this end, we have to tackle the following challenges

  • Challenge I: In order to recover the faulty SSC, we need to select the fault-free links, to replace the faulty links in the spanning trees of the SSC. However, the selected fault-free links are different for the faulty SSCs with different types of link failures. And it may affect the number of shared links in the restored SSCs. In addition, each link has a different failure probability. Therefore, it is a challenge to select the most appropriate fault-free link for the faulty SSC with different link failures, to obtain a restored SSC with nearly optimal network survivability.

  • Challenge II: Each link in the network has some QoS-related capacities, and different links have different capacities for the same QoS requirement. In this case, unreasonable link selection to recover the faulty SSC may make the restored SSC unable to meet the QoS requirements. Therefore, it is a challenge to select the best feasible link, to obtain a restored SSC with the desired bandwidth capacity.

For challenge I, we try to split each faulty spanning tree in the faulty SSC into two disjoint subtrees. These subtrees are rooted by two nodes of the corresponding faulty link, respectively. Then, we find a candidate link set for each faulty spanning tree. Each link in the link set can connect the two subtrees of the faulty spanning tree, for recovering data transmission in the faulty spanning tree. According to the relationship between all candidate link sets, we select the link with the minimum failure probability to replace the faulty link in the faulty SSC, under the premise of adding as few shared links to the restored SSC as possible. For challenge II, we try to form an auxiliary network for the given network. Specifically, we retrieve the links in the network one by one. And we judge if the bandwidth capacity of the retrieved link is larger than the bandwidth requirement of the network. Then, the auxiliary network is formed, by removing the links whose bandwidth capacities are less than the required bandwidth requirement. The main contributions of this paper are summarized as follows.

  • We focus on restoring the faulty SSC, both for the shared link failure and for the unshared link failure, by updating and adjusting its faulty spanning trees. Meanwhile, we formulate a SSC restoration problem, to maximize the network survivability of the restored SSC, under the bandwidth requirement of the network.

  • We design an efficient main algorithm to solve the SSC restoration problem. The proposed main algorithm consists of two fast restoration algorithms. The first is shared link max-survivability algorithm, which is used to deal with the shared link failure. This proposed algorithm forms a new SSC by updating the two faulty spanning trees in the invalid SSC. The second is unshared link max-survivability algorithm, which is used to cope with the unshared link failure. This proposed algorithm generates a new SSC by adjusting and updating the faulty spanning tree in the invalid SSC. In addition, the restored SSCs formed by the two proposed restoration algorithms have nearly optimal network survivability.

  • We conduct simulation to evaluate the performance of the two proposed restoration algorithms. Simulation results show that, the proposed algorithms can significantly reduce the recovery time by 36.69% and 55.54%, respectively, compared with the existing algorithm. And the proposed algorithms can achieve nearly optimal network survivability.

Compared with our earlier work [31], we claim the following new contributions. First, in addition to restoring the SSC from the shared link failure, we further consider how to restore the SSC from the unshared link failure. Second, we present an efficient main algorithm to show how to solve the formulated SSC restoration problem in detail. Meanwhile, a new restoration algorithm is proposed to deal with the unshared link failure. Third, we conduct more simulation to evaluate our proposed restoration algorithms.

The remainder of this paper is organized as follows. SSC restoration problem is formulated in Section 2, while the restoration algorithms are presented in Section 3. Section 4 shows the simulation results, and Section 5 concludes this paper with future remarks.

Section snippets

Preliminaries

A network can be represented by an undirected graph G(V,E), where V is the set of nodes and E is the set of links. The link e(m,n)E connects two nodes m,nV, and it is associated with a tuple (pe,be), in which pe and be denote its failure probability and bandwidth capacity, respectively. The failure probability pe of each link e(m,n) is independent, pe(0,1], and it can be estimated from the historical failure data of the network. The network G(V,E) has a bandwidth requirement B on the links,

Algorithms

Experiments

In this section, we conduct simulations to evaluate the performance of the proposed algorithms, in terms of recovery time and network survivability. Algorithm CBMS [30] is selected as a baseline algorithm to make comparisons with the proposed algorithms.

Conclusion

We have formulated an SSCR problem to obtain a restored SSC with the nearly optimal network survivability and the desired bandwidth requirement. Two fast restoration algorithms have been proposed to solve the SSCR problem under the shared link failure and the unshared link failure, respectively. These two proposed restoration algorithms can achieve nearly optimal network survivability. Simulation results show that, both algorithm SLMS and algorithm USLMS can significantly reduce the recovery

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62072118 and 62174038.

Jiale Huang received B.E. degree and M.E. degree from Guangdong University of Technology in 2018 and 2021, respectively. He is now working toward Ph.D. degree in Computer Science and Technology from Guangdong University of Technology. His main research interests include fault tolerant computing, network economics and mobile computing.

References (34)

  • JohnstonM. et al.

    A robust optimization approach to backup network design with random failures

    IEEE/ACM Trans Netw

    (2015)
  • KhanM. et al.

    Multi-path link embedding for survivability in virtual networks

    IEEE Trans Netw Serv Manag

    (2016)
  • KilliB. et al.

    Link failure aware capacitated controller placement in software defined networks

  • YallouzJ. et al.

    Minimum-weight link-disjoint node-“somewhat disjoint” paths

    IEEE/ACM Trans Netw

    (2018)
  • SprecherN. et al.

    MPLS transport profile (MPLS-TP) survivability frameworkIETF RFC 6372

    (2011)
  • G.8032: Ethernet ring protection switching

    (2020)
  • BejeranoY. et al.

    Algorithms for computing QoS paths with restoration

    IEEE/ACM Trans Netw

    (2005)
  • Cited by (0)

    Jiale Huang received B.E. degree and M.E. degree from Guangdong University of Technology in 2018 and 2021, respectively. He is now working toward Ph.D. degree in Computer Science and Technology from Guangdong University of Technology. His main research interests include fault tolerant computing, network economics and mobile computing.

    Lulu Zheng received the Bachelor degree of Science in Information and Computing Science from Anqing Normal University in 2016. She received the Master degree at School of Computer Science and Technology, Guangdong University of Technology in 2019. She currently works for Alibaba. Her main research interests are fault tolerant techniques in reconfigurable networks.

    Yalan Wu received B.Sc. degree and Ph.D. degree from Guangdong University of Technology in 2016 and 2021, respectively. Now, she works as the postdoctor at School of Computer Science and Technology, Guangdong University of Technology. Her research interests include fault tolerant computing, vehicular networks, mobile computing and high performance architecture.

    Peng Liu received B.E. degree from Xiangtan University in 2006, M.E. degree and Ph.D. degree from Hunan University in 2011 and 2017, respectively. He is currently a lectorate at School of Computer Science and Technology, Guangdong University of Technology. His research interests include digital circuit testing, memristor-based circuit design, and test.

    Jigang Wu received B.Sc. degree from Lanzhou University, and Ph.D. degree from the University of Science and Technology of China. Now, he is distinguished professor of School of Computer Science and Technology, Guangdong University of Technology. His research interests include network computing and machine learning.

    This paper is for special section VSI-pdcat4. Reviews were processed by Guest Editor Dr. Yicheng Xu and recommended for publication.

    View full text