A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids

doi:10.1016/j.engappai.2015.11.002

Engineering Applications of Artificial Intelligence

Volume 48, February 2016, Pages 140-158

https://doi.org/10.1016/j.engappai.2015.11.002 Get rights and content

Abstract

Mining grid data is an interesting research field which aims at analyzing grid systems with data mining techniques in order to efficiently discover new meaningful knowledge to enhance grid management. In this paper, we focus particularly on how extracted knowledge enables enhancing data replication and replica selection strategies which are important data management techniques commonly used in data grids. Indeed, relevant knowledge such as file access patterns, file correlations, user or job access behavior, prediction of future behavior or network performance, and so on, can be efficiently discovered. These findings are then used to enhance both data replication and replica selection strategies. Various works in this respect are then discussed along with their merits and demerits. In addition, we propose a new guideline to data mining application in the context of data replication and replica selection strategies.

Section snippets

Introduction and motivations

Data grids primarily deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and manage massive data sets stored in distributed storage resources. Data-intensive applications are becoming increasingly prevalent in domains of scientific and engineering research such as High energy physics, Earth science, bioinformatics, data mining, and Astronomy. In this kind of dynamic and large scale environment, a lot of challenges revolve

Background of data mining

Data mining can be defined as the automated process of extracting previously unknown and useful knowledge and information including patterns, associations, changes, trends, anomalies and significant structures from large or complex data sets (Han et al., 2011, Zaki, 2014).

The following paragraphs give an overview on association analysis, classification and clustering which are the main data mining tasks relied on by data grid strategies. Note however that several others data mining tasks exist

Utility of replication

Effective data management is one critical issue in data grid systems and involves many challenges. In this regard, replication is one of the most used ways to effectively cope with these challenges. It is also used in distributed databases systems (Nicola and Jarke, 2000), mobile systems (Padmanabhan et al., 2008), P2P systems (Martins et al., 2006), parallel and distributed systems (Goel and Buyya, 2006), cloud systems (Malik et al., 2015), to quote but a few.

The main idea of replication in

Replication strategies based on data mining techniques

In this section, replication strategies based on data mining techniques are presented. The strategies are grouped according to the data mining technique they use. In this regard, the first five ones mainly rely on pattern mining, the next two ones mainly use Bayesian network, the eighth and the ninth ones apply the clustering techniques, while the last strategy is based on a classification technique.

Replica selection strategies based on data mining techniques

In data grid, large data sets, in the magnitudes of tera bytes or even peta bytes, are replicated over dispersed sites. In this context, data transfers are very costly and consume large amounts of bandwidth (Ranganathan and Foster, 2001a). This has led to the question of which replica can be accessed most efficiently (Vazhkudai and Schopf, 2003). Indeed, when different sites hold a replica of a particular file, there is a significant interest in selecting the most appropriate replica site. A

Proposed guideline

At a glance, a strategy based on data mining technique should indeed be composed by three key steps:

•
First step: The grid data selection and preprocessing. In this respect, which data to consider in the grid data mining process is an important issue for which a right solution must be found. This indeed constitutes a key factor for the success of the whole process. Indeed, before starting a data mining process, in order to extract useful knowledge, such as network performance prediction, file

Conclusion

We have presented in this paper a survey of data mining-based replication and replica selection strategies dedicated to data grids. The main objective of this work consists in the study of how data mining techniques can be applied to historical grid data and how do they discover new interesting knowledge and use them to enhance both data replication and replica selection strategies. Three contributions are made in this work: (i) A survey of the main replication strategies based on data mining

Acknowledgments

We would like to express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions.

References (92)

B. Allcock et al.
Data management and transfer in high-performance computational grid environments
Parallel Comput.
(2002)
R.M. Almuttairi et al.
A two phased service oriented broker for replica selection in data grids
Future Gener. Comput. Syst.
(2013)
T. Amjad et al.
A survey of dynamic replication strategies for improving data availability in data grids
Future Gener. Comput. Syst.
(2012)
A. Doğan
A study on performance of dynamic file replication algorithms for real-time file access in data grids
Future Gener. Comput. Syst.
(2009)
P. Giudici et al.
Data mining of association structures to model consumer behaviour
Comput. Stat. Data Anal.
(2002)
T. Hamrouni et al.
Impact of the distribution quality of file replicas on replication strategies
J. Netw. Comput. Appl.
(2015)
T. Hamrouni et al.
A data mining correlated patterns-based periodic decentralized replication strategy for data grids
J. Syst. Softw.
(2015)
L.M. Khanli et al.
PHFSa dynamic replication method, to decrease access latency in the multi-tier data grid
Future Gener. Comput. Syst.
(2011)
L.M. Khanli et al.
Active rule learning using decision tree for resource management in grid computing
Future Gener. Comput. Syst.
(2011)
M. Lee et al.
PFRFan adaptive data replication algorithm based on star-topology data grids
Future Gener. Comput. Syst.
(2012)

J. Ma et al.

A classification of file placement and replication methods on grids

Future Gener. Comput. Syst.

(2013)

R.M. Rahman et al.

Replica selection strategies in data grid

J. Parallel Distrib. Comput.

(2008)

N. Saadat et al.

PDDRAa new pre-fetching based dynamic data replication algorithm in data grids

Future Gener. Comput. Syst.

(2012)

M. Tang et al.

Dynamic replication algorithms for the multi-tier data grid

Future Gener. Comput. Syst.

(2005)

D. Yuan et al.

A data placement strategy in scientific cloud workflows

Future Gener. Comput. Syst.

(2010)

Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of the 20th International...

Almuttairi, R.M., 2012. Replica selection technique for binding cheapest replica sites in data grids. In: Proceedings...

Almuttairi, R.M., Wankar, R., Negi, A., Chillarige, R.R., 2010a. Rough set clustering approach to replica selection in...

Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010b. Intelligent replica selection strategy for data grid. In:...

Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010c. Replica selection in data grids using preconditioning of...

Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010d. Smart replica selection for data grids using rough set...

Baheri, F.V., Davardoost, F., Ahmadzadeh, V., 2012. Data mining with learning decision tree and Bayesian network for...

Bautista Villalpando, L.E., April, A., Abran, A., 2014. Performance analysis model for big data applications in cloud...

W.H. Bell et al.

Simulation of dynamic grid replication strategies in OptorSim

J. High Perform. Comput. Appl.

(2002)

W.H. Bell et al.

OptorSima grid simulator for studying dynamic data replication strategies

Int. J. High Perform. Comput. Appl.

(2003)

Bell, W.H., Cameron, D.G., Carvajal-Schiaffino, R., Millar, A.P., Stockinger, K., Zini, F., 2003. Evaluation of an...

F. Ben Charrada et al.

An efficient replica placement strategy in highly dynamic data grids

Int. J. Grid Util. Comput.

(2011)

D. Boru et al.

Energy-efficient data replication in cloud computing datacenters

Clust. Comput.

(2015)

Bouasker, S., Hamrouni, T., Ben Yahia, S., 2012. New exact concise representation of rare correlated patterns:...

Bouyer, A., Karimi, M., Jalali, M., 2009. An online and predictive method for grid scheduling based on data mining and...

R. Buyya et al.

GridSima toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing

Concurr. Comput.: Pract. Exp.

(2002)

Charrada, F.B., Ounelli, H., Chettaoui, H., 2010. Dynamic period vs static period in data grid replication. In:...

Chettaoui, H., Ben Charrada, F., 2012. A decentralized periodic replication strategy based on knapsack problem. In:...

H. Chettaoui et al.

A new decentralized periodic replication strategy for dynamic data grids

Scalable Comput.: Pract. Exp.

(2014)

Z. Cui et al.

Based on support and confidence dynamic replication algorithm in multi-tier data grid

J. Comput. Inf. Syst.

(2013)

A. Doğan et al.

DGridSima multi-model discrete-event simulator for real-time data grid systems

Simulation

(2014)

Doraimani, S., 2007. Filecules: a new granularity for resource management in grids (Master thesis). University of South...

Duan, R., Prodan, R., Fahringer, T., 2006. Data mining-based fault prediction and detection on the grid. In:...

Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J., 2011. CoHadoop: flexible data placement...

Foster, I., 2007. Grid and data mining: more related than you might think. In: National Science Foundation Symposium on...

I. Foster et al.

The anatomy of the gridenabling scalable virtual organizations

Int. J. High Perform. Comput. Appl.

(2001)

Fu, X., Ren, R., Zhan, J., Zhou, W., Jia, Z., Lu, G., 2012. LogMaster: mining event correlations in logs of large-scale...

W. Gang et al.

A decentralized approach for mining event correlations in distributed system monitoring

J. Parallel Distrib. Comput.

(2013)

Goel, S., Buyya, R., 2006. Data replication strategies in wide area distributed systems. In: Enterprise Service...

R.K. Grace et al.

Data access prediction and optimization in data grid using SVM and AHL classifications

Int. Rev. Comput. Softw.

(2014)

R.K. Grace et al.

Dynamic replica placement and selection strategies in data grids—a comprehensive survey

J. Parallel Distrib. Comput.

(2014)

Cited by (29)

TS-REPLICA: A novel replica placement algorithm based on the entropy weight TOPSIS method in spark for multimedia data analysis
2023, Information Sciences
Performance optimization based on node attributes is of profound significance in the replica placement algorithm used in Hadoop distributed file system (HDFS). Currently, most researchers studying replica placement algorithms consider only a single attribute or a few multiple attributes. However, a single attribute cannot accurately express the performance of a node. Therefore, this paper proposes a replica placement algorithm based on the entropy weight TOPSIS (technique for order preference by similarity to ideal solution) method, called TS-REPLICA. First, a multi-attribute matrix that comprehensively reflects the performance and load of nodes is defined. Then, a TOPSIS-based algorithm is proposed to calculate the performance score of each data node. In addition, the entropy weight method is introduced to derive the weights of attributes for balancing the influence of weights of multiple attributes. Next, the comprehensive load score of each data node in the Spark cluster, the average comprehensive load score of each rack, and the average comprehensive load score of the entire cluster are calculated, and replica placement is performed based on the obtained scores. Finally, the effectiveness of the proposed algorithm is verified on various datasets and test cases. The experimental results show that the TS-REPLICA algorithm outperforms the better comparison algorithm in execution number in Spark cluster.
Optimization assisted frequent pattern mining for data replication in cloud: Combining sealion and grey wolf algorithm
2023, Advances in Engineering Software
It is critical in cloud computing to have excellent data accessibility and system performance. To improve system availability, commonly used data should be duplicated to many places, allowing users to access it from a nearby site. Deciding on a sensible number and location for replicas is a difficult problem in cloud computing. Therefore, a novel Data Replication system based on data mining techniques is being proposed in this research work. The data replication is done here by locating commonly utilized data patterns in a node's massive database. This will be accomplished using an optimization-assisted frequent pattern mining approach, with a novel hybrid algorithm performing the best threshold selection. The proposed hybrid algorithm referred to as Greywolves Updated Exploration and Exploitation with Sealion Behaviour (GUEES), hybrids the concept of Sealion Optimization Model (SLnO) and Grey wolf optimization (GWO) algorithms. Apart from this, the mining will be carried out under the defining dual constraints such as (i) Prioritization and (ii) Cost. The prioritization falls under two cases: queuing both high and low-priority data, and the cost relies on the evaluation of storage demand. The high-priority queues are optimized with the GUEES model. Finally, a comparative validation is carried out to validate the efficiency of the adopted model. Accordingly, when the number of requests=1000, the network usage of the proposed model is 35.07%, 34.9%, 30.5%, 29.23%, 24.57%, 16.8%, and 16.85% higher than the existing methods like SMO, LA, ROA, GWO, SLnO, PSO, HCS, respectively.
A multi-objective optimized replication using fuzzy based self-defense algorithm for cloud computing
2020, Journal of Network and Computer Applications
Citation Excerpt :
Data replication techniques have been extensively used for many years in P2P network, WWW, mesh networks, ad-hoc, and sensor networking (Milani and Navimipour, 2016). In recent years, the emergence of distributed systems such as cloud (Mansouri et al., 2013; Mansouri, 2016a) and grid (Hamrouni et al., 2016; Mansouri, 2016b) has made the replication technique becoming a hot topic once again. In a cloud system, different engineering applications that analyze large-scale data need replication strategy, which has attracted more attention recently.
Cloud computing has attracted increasing attention in data management. Data replication, which brings files closer to the data consumers, is a well-known technique that reduces access time and bandwidth consumption. This paper addresses two issues concerning replica placement process. The first is how to reduce access costs and replication costs that are two conflicting goals. To achieve this, we propose a multi-objective optimized placement algorithm based on meta-heuristic technique and fuzzy system that finds the optimal locations for replicas by balancing the trade-offs among the six optimization objectives (i.e., system availability, service time, load, energy consumption, latency, and centrality). The second issue is how to determine the optimal number of replicas since storing a great number of replicas in cloud is expensive. To solve this problem, we determine the number of replicas without excessively reducing the performance. In addition, we improve self-defense algorithm by a new prey-predator model based on a fuzzy system to simulate the interaction between prey and predator population. The superior results with ten benchmark functions demonstrate the merits of the proposed fuzzy-self-defense algorithm in solving the problems compared with seven optimization algorithms. Moreover, the extensive simulations with CloudSim prove that the proposed replication algorithm outperforms the main existing replication strategies in terms of hit ratio, number of replications, load variance, latency, average service time, availability, and energy consumption.
A new Prefetching-aware Data Replication to decrease access latency in cloud environment
2018, Journal of Systems and Software
Citation Excerpt :
This capability persuades many service providers to present applications and services to many users based on the Cloud technology. Data replication approach in distributed systems (Grid, Cloud computing) is one of the performance enhancement strategies for software system, which replicates data file at more than one location (Vobugari et al., 2013; Fahmideh and Beydoun, 2018; Hamrouni et al., 2016). When one site fails, the system can work by replicas, hence, enhancing reliability and availability.
Data replication is an effective technique that decreases retrieval time, thus reducing energy consumption in Cloud. When necessary files aren't locally available, they will be fetched from remote locations that is very high-time consuming process. Therefore, it is superior to pre-replicate the popular files. Even though few previous works considered prediction-based replication strategy, the prediction is not precise at many situations and occupies the storage. To address these challenges, a new dynamic replication strategy called Prefetching-aware Data Replication (PDR) is proposed, which determines the correlation of the data files using the file access history and pre-fetches the most popular files. So, the next time that this site requires a file, it will be locally available. In addition, due to the storage space restriction, replica replacement strategy plays a vital role. PDR strategy can ascertain the importance of valuable replicas based on the fuzzy inference system with four input parameters (i.e., number of accesses, cost of replica, the last time the replica was accessed, and data availability). Extensive experiments with CloudSim show that PDR achieves high data availability, high hit ratio, low storage and bandwidth consumption. On average PDR reduces over 35% of response time when compared to the other algorithms.
A Systematic Literature Review of the Data Replication Techniques in the Cloud Environments
2017, Big Data Research
Citation Excerpt :
However, this survey was limited to data mining techniques and their data replication survey was in the field of the grid. Another survey is a survey of dynamic replication and replica selection strategies based on data mining techniques in data grids that have proposed by Hamrouni et al. [25]. This paper has focused particularly on how extracted knowledge enables enhancing data replication and replica selection strategies which are important data management techniques commonly used in data grids.
Cloud computing has various challenges, one of them is using copied data. Data replication is an important technique for distributed mass data management. The aim of the general idea of data replication is placing replications at different places, while there are several replications of a specific file at different points. Replication is one of the most broadly studied phenomena in the distributed environments in which multiple copies of some data are stored at multiple sites where overheads of creating, maintaining and updating the replicas are important and challenging issues. Applications and architecture of distributed computing have changed drastically during last decade and so has replication protocols. Different replication protocols may be suitable for different applications. However, despite the importance of this issue, in a cloud environment as a distributed environment, this issue has not been investigated so far systematically. The data replication in the cloud environment falls into two categories of static and dynamic methods. In the static patterns, a number of created replicas is constant and fixed from the beginning. The number is either determined by the user from the beginning or the cloud environment determines such number. However, in the dynamic algorithm and considering its environment, the number is determined based on user's access algorithm. The objective of this paper is to review the data replication techniques in these two main groups systematically as well as a discussing the main features of each group.
DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments
2017, Simulation Modelling Practice and Theory
Citation Excerpt :
Data replication is necessary to enhance data accessibility, availability, and fault tolerance, while improving data access time and load of network. In order to achieve these goals, different data replication algorithms have been designed in different systems such as data grid [32–34], Cloud storage [35,36], P2P [37–39], and Content delivery network (CDN) [40,41]. We can improve decision of what data is necessary to enhance the availability and resource utilization by using popularity prediction.
Cloud computing has emerged as a main approach for managing huge distributed data in different areas such as scientific operations and engineering experiments. In this regard, data replication in Cloud environments is a key strategy that reduces response time and improves reliability. One of the main features of a distributed environment is to replicate data in various sites such that popular data would be more available. Whenever a site does not have a needed data file, it will have to fetch it from other locations. Therefore, the parallel download approach is applied to reduce download time. It enables a user to get various parts of a file from several sites simultaneously. In this work, we present a data replication strategy, named the Dynamic Popularity aware Replication Strategy (DPRS), which is presented on Cloud system leveraging data access behavior. DPRS replicates only a small amount of frequently requested data file based on 80/20 idea. It determines to which site the file is replicated based on number of requests, free storage space, and site centrality. We introduce a parallel downloading approach that replicates data segments and parallel downloads replicated data fragments, to enhance the overall performance. We evaluate effective network usage, mean job execution time, hit ratio, total number of replications and percentage of storage filled by using the CloudSim simulator. Extensive experimentations demonstrate the effectiveness of DPRS under most of access patterns.

View all citing articles on Scopus

^☆: This paper is a largely extended version of the work presented in Hamrouni et al. (2015c).

View full text

Survey PaperA survey of dynamic replication and replica selection strategies based on data mining techniques in data grids☆

Abstract

Section snippets

Introduction and motivations

Background of data mining

Utility of replication

Replication strategies based on data mining techniques

Replica selection strategies based on data mining techniques

Proposed guideline

Conclusion

Acknowledgments

Parallel Comput.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Comput. Stat. Data Anal.

J. Netw. Comput. Appl.

J. Syst. Softw.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

J. Parallel Distrib. Comput.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Simulation of dynamic grid replication strategies in OptorSim

J. High Perform. Comput. Appl.

OptorSima grid simulator for studying dynamic data replication strategies

Int. J. High Perform. Comput. Appl.

An efficient replica placement strategy in highly dynamic data grids

Int. J. Grid Util. Comput.

Energy-efficient data replication in cloud computing datacenters

Clust. Comput.

GridSima toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing

Concurr. Comput.: Pract. Exp.

A new decentralized periodic replication strategy for dynamic data grids

Scalable Comput.: Pract. Exp.

Based on support and confidence dynamic replication algorithm in multi-tier data grid

J. Comput. Inf. Syst.

DGridSima multi-model discrete-event simulator for real-time data grid systems

Simulation

The anatomy of the gridenabling scalable virtual organizations

Int. J. High Perform. Comput. Appl.

A decentralized approach for mining event correlations in distributed system monitoring

J. Parallel Distrib. Comput.

Data access prediction and optimization in data grid using SVM and AHL classifications

Int. Rev. Comput. Softw.

Dynamic replica placement and selection strategies in data grids—a comprehensive survey

J. Parallel Distrib. Comput.

Survey Paper
A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids☆