Skip to main content
Log in

DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Cloud computing systems provide high-performance computing resources and distributed storage space to deal with data-intensive computations. Data scheduling between data centers is becoming indispensable for the cloud computing systems in which a mass of large datasets is stored at different data centers and inter-center data accesses are needed in data analytics. However, the performance of data scheduling is highly dependent upon the rationality of data placement. Data placement is a key optimization method for reducing data scheduling between data centers and realizing statistical I/O load balancing, accordingly reducing the mean computation execution time. This paper proposes a data placement strategy, DCCP, which is developed based on dynamic computation correlation. DCCP places the datasets with high dynamic computation correlations at the same data center considering the I/O load and the capacity load of data centers; when computations are scheduled for this data center, most of the datasets they process are stored locally, and thus the mean computation execution time can be reduced. Evidence from a large number of experiments proves that the DCCP can achieve the statistical I/O load balancing and the capacity load balancing of data centers, thus reducing the total data scheduling between data centers as much as possible at a very low time complexity, even as the numbers of datasets and data centers increase.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Zheng ZG, Wang P, Liu J et al (2015) Real-time big data processing framework: challenges and solutions. Appl Math Inf Sci 9(6):2217–2237

    MathSciNet  Google Scholar 

  2. Pan Y, Zhang J (2012) Parallel programming on cloud computing platforms—challenges and solutions. J Converg 3(4):23–28

    MathSciNet  Google Scholar 

  3. Deelman E, Chervenak A (2008) Data management challenges of data-intensive scientific workflows. In: Proceedings of the 8th IEEE international symposium on cluster computing and the grid (CCGRID’08), Lyon, pp 687–692

  4. Mahajan K, Makroo A, Dahiya D (2013) Round Robin with server affinity: a VM load balancing algorithm for cloud based infrastructure. J Inf Process Syst 9(3):379–394

    Article  Google Scholar 

  5. Li X, Mitton N, Nayak A et al (2012) Achieving load awareness in position-based wireless ad hoc routing. J Converg 3(3):17–22

    Google Scholar 

  6. Qin X (2008) Performance comparisons of load balancing algorithms for I/O-intensive workloads on clusters. J Netw Comput Appl 31(1):32–46

    Article  Google Scholar 

  7. Qin X, Jiang H, Manzanares A, Ruan X, Yin S (2009) Dynamic load balancing for I/O-intensive applications on clusters. ACM Trans Storage 5(3):9–46

    Article  Google Scholar 

  8. Maguluri ST, Srikant R, Ying L (2012) Stochastic models of load balancing and scheduling in cloud computing clusters. In: Proceedings of the 30th IEEE international conference on computer communications (INFOCOM), Shanghai, pp 702–710

  9. Goel N, Shyamasundar RK (2012) An executional framework for BPMN using Orc. J Converg 3(1):29–36

    Google Scholar 

  10. Kosar T, Livny M (2004) Stork: making data placement a first class citizen in the grid. In: Proceedings of 24th international conference on distributed computing systems (ICDCS 2004). Keio University, Japan, pp 342–349

  11. Ahmad I, Karlapalem K, Kwok Y, So S (2002) Evolutionary algorithms for allocating data in distributed database systems. Distrib Parallel Databases 11(1):5–32

    Article  MATH  Google Scholar 

  12. Guo J, Wang Y, Tang KS (2008) Evolutionary optimization of file assignment for a large-scale video-on-demand system. IEEE Trans Knowl Data Eng 20(6):836–850

    Article  Google Scholar 

  13. Uysal M, Ulus T (2007) A threshold based dynamic data allocation algorithm—a Markov chain model approach. J Appl Sci 7(2):165–174

    Article  Google Scholar 

  14. Brinkmann A, Effert S, Scheideler C (2007) Dynamic and redundant data placement. In: Proceedings of the 27th international conference on distributed computing systems (ICDCS’07), Toronto, pp 29–39

  15. Lee L, Scheuermann P, Vingralek R (2000) File assignment in parallel I/O systems with minimal variance of service time. IEEE Trans Comput 49(2):127–140

    Article  Google Scholar 

  16. Madathil D K, Thota R B, Paul P (2008) A static data placement strategy towards perfect load-balancing for distributed storage clusters. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS 2008), Miami, pp 1–8

  17. Park S, Jung IY, Eom H, Yeom HY (2013) An analysis of replication enhancement for a high availability cluster. J Inf Process Syst 9(2):205–216

    Article  Google Scholar 

  18. Zhu C, Zhu Q, Zuzarte C et al (2013) Developing a dynamic materialized view index for efficiently discovering usable views for progressive queries. J Inf Process Syst 9(4):511–537

    Article  Google Scholar 

  19. Bohannon P, Fan W, Geerts F (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 23rd IEEE international conference on data engineering (ICDE2007), Istanbul, pp 746–755

  20. Geert M, Monique S, Wilfried L (2012) Managing data dependencies in service compositions. J Syst Softw 85(11):2604–2628

    Article  Google Scholar 

  21. Doraimani S, Iamnitchi A (2008) File grouping for scientific data management: lessons from experimenting with real traces. In: Proceedings of the 17th ACM international symposium on high performance distributed computing (HPDC-17), Boston, pp 153–164

  22. Fedak G, He H, Cappello F (2008) BitDew: a programmable environment for large-scale data management and distribution. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08), Austin, pp 1–12

  23. Agarwal S, Dunagan J, Jain N (2010) Volley: automated data placement for geo-distributed cloud services. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation (NSDI’10), San Jose, pp 17–32

  24. Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214

    Article  Google Scholar 

  25. Zheng P, Cui L, Wang H, Xu M (2010) A data placement strategy for data-intensive applications in Cloud. Chin J Comput 33(8):1472–1480

    Article  Google Scholar 

  26. Nukarapu DT, Bin T, Wang L (2011) Data replication in data intensive scientific applications with performance guarantee. IEEE Trans Parallel Distrib Syst 22(8):1299–1306

    Article  Google Scholar 

  27. Kosar T, Livny M (2005) A framework for reliable and efficient data placement in distributed computing systems. J Parallel Distrib Comput 65:1146–1157

    Article  Google Scholar 

  28. Ranganathan K, Foster I (2002) Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of 11th IEEE international symposium on high performance distributed computing (HPDC-11), Edinburgh, pp 352–358

  29. Jeong D, Ji S-Y, Suma EA et al (2015) Designing a collaborative visual analytics system to support users’ continuous analytical processes. Human-centric Comput Inf Sci 5(5):1–20

    Google Scholar 

  30. Kim H, Lee S-H, Sohn M-K et al (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4:9

    Article  Google Scholar 

  31. Li R, Feng W, Wang H (2014) A new parameter estimation method for a zipf-like distribution for geospatial data access. ETRI J 36(1):134–140

    Article  MathSciNet  Google Scholar 

  32. Albayram Y, Khan MMH, Bamis A et al (2015) Designing challenge questions for location-based authentication systems: a real-life study. Human-centric Comput Inf Sci 5:17

    Article  Google Scholar 

  33. Li R, Zhang Y, Xu Z (2013) A Load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener Comput Syst 29(22):528–535

    Article  Google Scholar 

Download references

Acknowledgments

The research work reported in this paper is supported by the National Basic Research Program of China (No: 2011CB302306), the National Natural Science Foundation of China (No: 41271398) and the National Natural Science Foundation of China under Grant (No: 61402421).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengquan Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Yao, S., Xu, Z. et al. DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems. J Supercomput 72, 2537–2564 (2016). https://doi.org/10.1007/s11227-015-1511-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1511-z

Keywords

Navigation