Abstract
Cloud computing systems provide high-performance computing resources and distributed storage space to deal with data-intensive computations. Data scheduling between data centers is becoming indispensable for the cloud computing systems in which a mass of large datasets is stored at different data centers and inter-center data accesses are needed in data analytics. However, the performance of data scheduling is highly dependent upon the rationality of data placement. Data placement is a key optimization method for reducing data scheduling between data centers and realizing statistical I/O load balancing, accordingly reducing the mean computation execution time. This paper proposes a data placement strategy, DCCP, which is developed based on dynamic computation correlation. DCCP places the datasets with high dynamic computation correlations at the same data center considering the I/O load and the capacity load of data centers; when computations are scheduled for this data center, most of the datasets they process are stored locally, and thus the mean computation execution time can be reduced. Evidence from a large number of experiments proves that the DCCP can achieve the statistical I/O load balancing and the capacity load balancing of data centers, thus reducing the total data scheduling between data centers as much as possible at a very low time complexity, even as the numbers of datasets and data centers increase.
Similar content being viewed by others
References
Zheng ZG, Wang P, Liu J et al (2015) Real-time big data processing framework: challenges and solutions. Appl Math Inf Sci 9(6):2217–2237
Pan Y, Zhang J (2012) Parallel programming on cloud computing platforms—challenges and solutions. J Converg 3(4):23–28
Deelman E, Chervenak A (2008) Data management challenges of data-intensive scientific workflows. In: Proceedings of the 8th IEEE international symposium on cluster computing and the grid (CCGRID’08), Lyon, pp 687–692
Mahajan K, Makroo A, Dahiya D (2013) Round Robin with server affinity: a VM load balancing algorithm for cloud based infrastructure. J Inf Process Syst 9(3):379–394
Li X, Mitton N, Nayak A et al (2012) Achieving load awareness in position-based wireless ad hoc routing. J Converg 3(3):17–22
Qin X (2008) Performance comparisons of load balancing algorithms for I/O-intensive workloads on clusters. J Netw Comput Appl 31(1):32–46
Qin X, Jiang H, Manzanares A, Ruan X, Yin S (2009) Dynamic load balancing for I/O-intensive applications on clusters. ACM Trans Storage 5(3):9–46
Maguluri ST, Srikant R, Ying L (2012) Stochastic models of load balancing and scheduling in cloud computing clusters. In: Proceedings of the 30th IEEE international conference on computer communications (INFOCOM), Shanghai, pp 702–710
Goel N, Shyamasundar RK (2012) An executional framework for BPMN using Orc. J Converg 3(1):29–36
Kosar T, Livny M (2004) Stork: making data placement a first class citizen in the grid. In: Proceedings of 24th international conference on distributed computing systems (ICDCS 2004). Keio University, Japan, pp 342–349
Ahmad I, Karlapalem K, Kwok Y, So S (2002) Evolutionary algorithms for allocating data in distributed database systems. Distrib Parallel Databases 11(1):5–32
Guo J, Wang Y, Tang KS (2008) Evolutionary optimization of file assignment for a large-scale video-on-demand system. IEEE Trans Knowl Data Eng 20(6):836–850
Uysal M, Ulus T (2007) A threshold based dynamic data allocation algorithm—a Markov chain model approach. J Appl Sci 7(2):165–174
Brinkmann A, Effert S, Scheideler C (2007) Dynamic and redundant data placement. In: Proceedings of the 27th international conference on distributed computing systems (ICDCS’07), Toronto, pp 29–39
Lee L, Scheuermann P, Vingralek R (2000) File assignment in parallel I/O systems with minimal variance of service time. IEEE Trans Comput 49(2):127–140
Madathil D K, Thota R B, Paul P (2008) A static data placement strategy towards perfect load-balancing for distributed storage clusters. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS 2008), Miami, pp 1–8
Park S, Jung IY, Eom H, Yeom HY (2013) An analysis of replication enhancement for a high availability cluster. J Inf Process Syst 9(2):205–216
Zhu C, Zhu Q, Zuzarte C et al (2013) Developing a dynamic materialized view index for efficiently discovering usable views for progressive queries. J Inf Process Syst 9(4):511–537
Bohannon P, Fan W, Geerts F (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 23rd IEEE international conference on data engineering (ICDE2007), Istanbul, pp 746–755
Geert M, Monique S, Wilfried L (2012) Managing data dependencies in service compositions. J Syst Softw 85(11):2604–2628
Doraimani S, Iamnitchi A (2008) File grouping for scientific data management: lessons from experimenting with real traces. In: Proceedings of the 17th ACM international symposium on high performance distributed computing (HPDC-17), Boston, pp 153–164
Fedak G, He H, Cappello F (2008) BitDew: a programmable environment for large-scale data management and distribution. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08), Austin, pp 1–12
Agarwal S, Dunagan J, Jain N (2010) Volley: automated data placement for geo-distributed cloud services. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation (NSDI’10), San Jose, pp 17–32
Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214
Zheng P, Cui L, Wang H, Xu M (2010) A data placement strategy for data-intensive applications in Cloud. Chin J Comput 33(8):1472–1480
Nukarapu DT, Bin T, Wang L (2011) Data replication in data intensive scientific applications with performance guarantee. IEEE Trans Parallel Distrib Syst 22(8):1299–1306
Kosar T, Livny M (2005) A framework for reliable and efficient data placement in distributed computing systems. J Parallel Distrib Comput 65:1146–1157
Ranganathan K, Foster I (2002) Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of 11th IEEE international symposium on high performance distributed computing (HPDC-11), Edinburgh, pp 352–358
Jeong D, Ji S-Y, Suma EA et al (2015) Designing a collaborative visual analytics system to support users’ continuous analytical processes. Human-centric Comput Inf Sci 5(5):1–20
Kim H, Lee S-H, Sohn M-K et al (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4:9
Li R, Feng W, Wang H (2014) A new parameter estimation method for a zipf-like distribution for geospatial data access. ETRI J 36(1):134–140
Albayram Y, Khan MMH, Bamis A et al (2015) Designing challenge questions for location-based authentication systems: a real-life study. Human-centric Comput Inf Sci 5:17
Li R, Zhang Y, Xu Z (2013) A Load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener Comput Syst 29(22):528–535
Acknowledgments
The research work reported in this paper is supported by the National Basic Research Program of China (No: 2011CB302306), the National Natural Science Foundation of China (No: 41271398) and the National Natural Science Foundation of China under Grant (No: 61402421).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, T., Yao, S., Xu, Z. et al. DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems. J Supercomput 72, 2537–2564 (2016). https://doi.org/10.1007/s11227-015-1511-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1511-z