Survey Paper
A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids

https://doi.org/10.1016/j.engappai.2015.11.002Get rights and content

Abstract

Mining grid data is an interesting research field which aims at analyzing grid systems with data mining techniques in order to efficiently discover new meaningful knowledge to enhance grid management. In this paper, we focus particularly on how extracted knowledge enables enhancing data replication and replica selection strategies which are important data management techniques commonly used in data grids. Indeed, relevant knowledge such as file access patterns, file correlations, user or job access behavior, prediction of future behavior or network performance, and so on, can be efficiently discovered. These findings are then used to enhance both data replication and replica selection strategies. Various works in this respect are then discussed along with their merits and demerits. In addition, we propose a new guideline to data mining application in the context of data replication and replica selection strategies.

Section snippets

Introduction and motivations

Data grids primarily deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and manage massive data sets stored in distributed storage resources. Data-intensive applications are becoming increasingly prevalent in domains of scientific and engineering research such as High energy physics, Earth science, bioinformatics, data mining, and Astronomy. In this kind of dynamic and large scale environment, a lot of challenges revolve

Background of data mining

Data mining can be defined as the automated process of extracting previously unknown and useful knowledge and information including patterns, associations, changes, trends, anomalies and significant structures from large or complex data sets (Han et al., 2011, Zaki, 2014).

The following paragraphs give an overview on association analysis, classification and clustering which are the main data mining tasks relied on by data grid strategies. Note however that several others data mining tasks exist

Utility of replication

Effective data management is one critical issue in data grid systems and involves many challenges. In this regard, replication is one of the most used ways to effectively cope with these challenges. It is also used in distributed databases systems (Nicola and Jarke, 2000), mobile systems (Padmanabhan et al., 2008), P2P systems (Martins et al., 2006), parallel and distributed systems (Goel and Buyya, 2006), cloud systems (Malik et al., 2015), to quote but a few.

The main idea of replication in

Replication strategies based on data mining techniques

In this section, replication strategies based on data mining techniques are presented. The strategies are grouped according to the data mining technique they use. In this regard, the first five ones mainly rely on pattern mining, the next two ones mainly use Bayesian network, the eighth and the ninth ones apply the clustering techniques, while the last strategy is based on a classification technique.

Replica selection strategies based on data mining techniques

In data grid, large data sets, in the magnitudes of tera bytes or even peta bytes, are replicated over dispersed sites. In this context, data transfers are very costly and consume large amounts of bandwidth (Ranganathan and Foster, 2001a). This has led to the question of which replica can be accessed most efficiently (Vazhkudai and Schopf, 2003). Indeed, when different sites hold a replica of a particular file, there is a significant interest in selecting the most appropriate replica site. A

Proposed guideline

At a glance, a strategy based on data mining technique should indeed be composed by three key steps:

  • First step: The grid data selection and preprocessing. In this respect, which data to consider in the grid data mining process is an important issue for which a right solution must be found. This indeed constitutes a key factor for the success of the whole process. Indeed, before starting a data mining process, in order to extract useful knowledge, such as network performance prediction, file

Conclusion

We have presented in this paper a survey of data mining-based replication and replica selection strategies dedicated to data grids. The main objective of this work consists in the study of how data mining techniques can be applied to historical grid data and how do they discover new interesting knowledge and use them to enhance both data replication and replica selection strategies. Three contributions are made in this work: (i) A survey of the main replication strategies based on data mining

Acknowledgments

We would like to express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions.

References (92)

  • J. Ma et al.

    A classification of file placement and replication methods on grids

    Future Gener. Comput. Syst.

    (2013)
  • R.M. Rahman et al.

    Replica selection strategies in data grid

    J. Parallel Distrib. Comput.

    (2008)
  • N. Saadat et al.

    PDDRAa new pre-fetching based dynamic data replication algorithm in data grids

    Future Gener. Comput. Syst.

    (2012)
  • M. Tang et al.

    Dynamic replication algorithms for the multi-tier data grid

    Future Gener. Comput. Syst.

    (2005)
  • D. Yuan et al.

    A data placement strategy in scientific cloud workflows

    Future Gener. Comput. Syst.

    (2010)
  • Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of the 20th International...
  • Almuttairi, R.M., 2012. Replica selection technique for binding cheapest replica sites in data grids. In: Proceedings...
  • Almuttairi, R.M., Wankar, R., Negi, A., Chillarige, R.R., 2010a. Rough set clustering approach to replica selection in...
  • Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010b. Intelligent replica selection strategy for data grid. In:...
  • Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010c. Replica selection in data grids using preconditioning of...
  • Almuttairi, R.M., Wankar, R., Negi, A., Rao, C.R., 2010d. Smart replica selection for data grids using rough set...
  • Baheri, F.V., Davardoost, F., Ahmadzadeh, V., 2012. Data mining with learning decision tree and Bayesian network for...
  • Bautista Villalpando, L.E., April, A., Abran, A., 2014. Performance analysis model for big data applications in cloud...
  • W.H. Bell et al.

    Simulation of dynamic grid replication strategies in OptorSim

    J. High Perform. Comput. Appl.

    (2002)
  • W.H. Bell et al.

    OptorSima grid simulator for studying dynamic data replication strategies

    Int. J. High Perform. Comput. Appl.

    (2003)
  • Bell, W.H., Cameron, D.G., Carvajal-Schiaffino, R., Millar, A.P., Stockinger, K., Zini, F., 2003. Evaluation of an...
  • F. Ben Charrada et al.

    An efficient replica placement strategy in highly dynamic data grids

    Int. J. Grid Util. Comput.

    (2011)
  • D. Boru et al.

    Energy-efficient data replication in cloud computing datacenters

    Clust. Comput.

    (2015)
  • Bouasker, S., Hamrouni, T., Ben Yahia, S., 2012. New exact concise representation of rare correlated patterns:...
  • Bouyer, A., Karimi, M., Jalali, M., 2009. An online and predictive method for grid scheduling based on data mining and...
  • R. Buyya et al.

    GridSima toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing

    Concurr. Comput.: Pract. Exp.

    (2002)
  • Charrada, F.B., Ounelli, H., Chettaoui, H., 2010. Dynamic period vs static period in data grid replication. In:...
  • Chettaoui, H., Ben Charrada, F., 2012. A decentralized periodic replication strategy based on knapsack problem. In:...
  • H. Chettaoui et al.

    A new decentralized periodic replication strategy for dynamic data grids

    Scalable Comput.: Pract. Exp.

    (2014)
  • Z. Cui et al.

    Based on support and confidence dynamic replication algorithm in multi-tier data grid

    J. Comput. Inf. Syst.

    (2013)
  • A. Doğan et al.

    DGridSima multi-model discrete-event simulator for real-time data grid systems

    Simulation

    (2014)
  • Doraimani, S., 2007. Filecules: a new granularity for resource management in grids (Master thesis). University of South...
  • Duan, R., Prodan, R., Fahringer, T., 2006. Data mining-based fault prediction and detection on the grid. In:...
  • Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J., 2011. CoHadoop: flexible data placement...
  • Foster, I., 2007. Grid and data mining: more related than you might think. In: National Science Foundation Symposium on...
  • I. Foster et al.

    The anatomy of the gridenabling scalable virtual organizations

    Int. J. High Perform. Comput. Appl.

    (2001)
  • Fu, X., Ren, R., Zhan, J., Zhou, W., Jia, Z., Lu, G., 2012. LogMaster: mining event correlations in logs of large-scale...
  • W. Gang et al.

    A decentralized approach for mining event correlations in distributed system monitoring

    J. Parallel Distrib. Comput.

    (2013)
  • Goel, S., Buyya, R., 2006. Data replication strategies in wide area distributed systems. In: Enterprise Service...
  • R.K. Grace et al.

    Data access prediction and optimization in data grid using SVM and AHL classifications

    Int. Rev. Comput. Softw.

    (2014)
  • R.K. Grace et al.

    Dynamic replica placement and selection strategies in data grids—a comprehensive survey

    J. Parallel Distrib. Comput.

    (2014)
  • Cited by (29)

    • A multi-objective optimized replication using fuzzy based self-defense algorithm for cloud computing

      2020, Journal of Network and Computer Applications
      Citation Excerpt :

      Data replication techniques have been extensively used for many years in P2P network, WWW, mesh networks, ad-hoc, and sensor networking (Milani and Navimipour, 2016). In recent years, the emergence of distributed systems such as cloud (Mansouri et al., 2013; Mansouri, 2016a) and grid (Hamrouni et al., 2016; Mansouri, 2016b) has made the replication technique becoming a hot topic once again. In a cloud system, different engineering applications that analyze large-scale data need replication strategy, which has attracted more attention recently.

    • A new Prefetching-aware Data Replication to decrease access latency in cloud environment

      2018, Journal of Systems and Software
      Citation Excerpt :

      This capability persuades many service providers to present applications and services to many users based on the Cloud technology. Data replication approach in distributed systems (Grid, Cloud computing) is one of the performance enhancement strategies for software system, which replicates data file at more than one location (Vobugari et al., 2013; Fahmideh and Beydoun, 2018; Hamrouni et al., 2016). When one site fails, the system can work by replicas, hence, enhancing reliability and availability.

    • A Systematic Literature Review of the Data Replication Techniques in the Cloud Environments

      2017, Big Data Research
      Citation Excerpt :

      However, this survey was limited to data mining techniques and their data replication survey was in the field of the grid. Another survey is a survey of dynamic replication and replica selection strategies based on data mining techniques in data grids that have proposed by Hamrouni et al. [25]. This paper has focused particularly on how extracted knowledge enables enhancing data replication and replica selection strategies which are important data management techniques commonly used in data grids.

    • DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments

      2017, Simulation Modelling Practice and Theory
      Citation Excerpt :

      Data replication is necessary to enhance data accessibility, availability, and fault tolerance, while improving data access time and load of network. In order to achieve these goals, different data replication algorithms have been designed in different systems such as data grid [32–34], Cloud storage [35,36], P2P [37–39], and Content delivery network (CDN) [40,41]. We can improve decision of what data is necessary to enhance the availability and resource utilization by using popularity prediction.

    View all citing articles on Scopus

    This paper is a largely extended version of the work presented in Hamrouni et al. (2015c).

    View full text