GNSGA: A Decentralized Data Replication Algorithm for Big Science Data | IEEE Conference Publication | IEEE Xplore

GNSGA: A Decentralized Data Replication Algorithm for Big Science Data


Abstract:

Domain science applications in fields such as Genomics and High-Energy Particle Physics use geographically distributed data federations for publishing and accessing datas...Show More

Abstract:

Domain science applications in fields such as Genomics and High-Energy Particle Physics use geographically distributed data federations for publishing and accessing datasets. Data is typically replicated among data federation nodes to improve efficiency and fault tolerance. While replication strategies are well documented in distributed database instances (e.g., Apache Cassandra), replication among distributed data storage nodes can be ad-hoc. Replication over wide area networks can also require global coordination (or global shared state) which is not ideal when multiple organizations are involved. In this paper, we introduce GNSGA, which stands for Greedy Non-dominated Sorting Genetic Algorithm II. It is an optimization algorithm that combines greedy and non-dominated sorting genetic algorithms to solve multi-objective optimization problems. The “greedy” aspect of the algorithm refers to the use of a greedy strategy in the selection of nodes, while the “Non-dominated Sorting Genetic Algorithm II (NSGA-II)” is a fast non-dominated multi-objective optimization algorithm with an elite retention strategy. Replication decisions in GNSGA are based on the local properties and resource availability of the data storage nodes. By incorporating Greedy and NSGA-II algorithms, GNSGA optimizes multiple conflicting objectives to satisfy replica placement constraints such as cost, time, and storage capacity. We compared GNSGA with popular replica placement strategies, such as closest node replication, shortest transfer time, and a Particle Swarm Optimization (PSO)-based replication algorithm. We performed simulations and an actual deployment on the NSF's FABRIC testbed for evaluation. The results demonstrate that GNSGA consistently selects nodes to reduce replication time by 5.8%-15.4% while satisfying replication constraints (i.e., cost, time, and storage). We also show that GNSGA is beneficial for replicating large files over wide area networks.
Date of Conference: 12-15 June 2023
Date Added to IEEE Xplore: 24 July 2023
ISBN Information:
Electronic ISSN: 1861-2288
Conference Location: Barcelona, Spain

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.