DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

Wang, Tao; Yao, Shihong; Xu, Zhengquan; Jia, Shan

doi:10.1007/s11227-015-1511-z

DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

Published: 30 August 2015

Volume 72, pages 2537–2564, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Tao Wang^1,2,
Shihong Yao^1,2,
Zhengquan Xu^1,2 &
…
Shan Jia^1,2

636 Accesses
21 Citations
Explore all metrics

Abstract

Cloud computing systems provide high-performance computing resources and distributed storage space to deal with data-intensive computations. Data scheduling between data centers is becoming indispensable for the cloud computing systems in which a mass of large datasets is stored at different data centers and inter-center data accesses are needed in data analytics. However, the performance of data scheduling is highly dependent upon the rationality of data placement. Data placement is a key optimization method for reducing data scheduling between data centers and realizing statistical I/O load balancing, accordingly reducing the mean computation execution time. This paper proposes a data placement strategy, DCCP, which is developed based on dynamic computation correlation. DCCP places the datasets with high dynamic computation correlations at the same data center considering the I/O load and the capacity load of data centers; when computations are scheduled for this data center, most of the datasets they process are stored locally, and thus the mean computation execution time can be reduced. Evidence from a large number of experiments proves that the DCCP can achieve the statistical I/O load balancing and the capacity load balancing of data centers, thus reducing the total data scheduling between data centers as much as possible at a very low time complexity, even as the numbers of datasets and data centers increase.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of Tasks Scheduling by an Efficacy Data Placement and Replication in Cloud Computing

A Data-Aware Scheduling Framework for Parallel Applications in a Cloud Environment

Workflow Coordinated Resources Allocation for Big Data Analytics in the Cloud

References

Zheng ZG, Wang P, Liu J et al (2015) Real-time big data processing framework: challenges and solutions. Appl Math Inf Sci 9(6):2217–2237
MathSciNet Google Scholar
Pan Y, Zhang J (2012) Parallel programming on cloud computing platforms—challenges and solutions. J Converg 3(4):23–28
MathSciNet Google Scholar
Deelman E, Chervenak A (2008) Data management challenges of data-intensive scientific workflows. In: Proceedings of the 8th IEEE international symposium on cluster computing and the grid (CCGRID’08), Lyon, pp 687–692
Mahajan K, Makroo A, Dahiya D (2013) Round Robin with server affinity: a VM load balancing algorithm for cloud based infrastructure. J Inf Process Syst 9(3):379–394
Article Google Scholar
Li X, Mitton N, Nayak A et al (2012) Achieving load awareness in position-based wireless ad hoc routing. J Converg 3(3):17–22
Google Scholar
Qin X (2008) Performance comparisons of load balancing algorithms for I/O-intensive workloads on clusters. J Netw Comput Appl 31(1):32–46
Article Google Scholar
Qin X, Jiang H, Manzanares A, Ruan X, Yin S (2009) Dynamic load balancing for I/O-intensive applications on clusters. ACM Trans Storage 5(3):9–46
Article Google Scholar
Maguluri ST, Srikant R, Ying L (2012) Stochastic models of load balancing and scheduling in cloud computing clusters. In: Proceedings of the 30th IEEE international conference on computer communications (INFOCOM), Shanghai, pp 702–710
Goel N, Shyamasundar RK (2012) An executional framework for BPMN using Orc. J Converg 3(1):29–36
Google Scholar
Kosar T, Livny M (2004) Stork: making data placement a first class citizen in the grid. In: Proceedings of 24th international conference on distributed computing systems (ICDCS 2004). Keio University, Japan, pp 342–349
Ahmad I, Karlapalem K, Kwok Y, So S (2002) Evolutionary algorithms for allocating data in distributed database systems. Distrib Parallel Databases 11(1):5–32
Article MATH Google Scholar
Guo J, Wang Y, Tang KS (2008) Evolutionary optimization of file assignment for a large-scale video-on-demand system. IEEE Trans Knowl Data Eng 20(6):836–850
Article Google Scholar
Uysal M, Ulus T (2007) A threshold based dynamic data allocation algorithm—a Markov chain model approach. J Appl Sci 7(2):165–174
Article Google Scholar
Brinkmann A, Effert S, Scheideler C (2007) Dynamic and redundant data placement. In: Proceedings of the 27th international conference on distributed computing systems (ICDCS’07), Toronto, pp 29–39
Lee L, Scheuermann P, Vingralek R (2000) File assignment in parallel I/O systems with minimal variance of service time. IEEE Trans Comput 49(2):127–140
Article Google Scholar
Madathil D K, Thota R B, Paul P (2008) A static data placement strategy towards perfect load-balancing for distributed storage clusters. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS 2008), Miami, pp 1–8
Park S, Jung IY, Eom H, Yeom HY (2013) An analysis of replication enhancement for a high availability cluster. J Inf Process Syst 9(2):205–216
Article Google Scholar
Zhu C, Zhu Q, Zuzarte C et al (2013) Developing a dynamic materialized view index for efficiently discovering usable views for progressive queries. J Inf Process Syst 9(4):511–537
Article Google Scholar
Bohannon P, Fan W, Geerts F (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 23rd IEEE international conference on data engineering (ICDE2007), Istanbul, pp 746–755
Geert M, Monique S, Wilfried L (2012) Managing data dependencies in service compositions. J Syst Softw 85(11):2604–2628
Article Google Scholar
Doraimani S, Iamnitchi A (2008) File grouping for scientific data management: lessons from experimenting with real traces. In: Proceedings of the 17th ACM international symposium on high performance distributed computing (HPDC-17), Boston, pp 153–164
Fedak G, He H, Cappello F (2008) BitDew: a programmable environment for large-scale data management and distribution. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08), Austin, pp 1–12
Agarwal S, Dunagan J, Jain N (2010) Volley: automated data placement for geo-distributed cloud services. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation (NSDI’10), San Jose, pp 17–32
Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214
Article Google Scholar
Zheng P, Cui L, Wang H, Xu M (2010) A data placement strategy for data-intensive applications in Cloud. Chin J Comput 33(8):1472–1480
Article Google Scholar
Nukarapu DT, Bin T, Wang L (2011) Data replication in data intensive scientific applications with performance guarantee. IEEE Trans Parallel Distrib Syst 22(8):1299–1306
Article Google Scholar
Kosar T, Livny M (2005) A framework for reliable and efficient data placement in distributed computing systems. J Parallel Distrib Comput 65:1146–1157
Article Google Scholar
Ranganathan K, Foster I (2002) Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of 11th IEEE international symposium on high performance distributed computing (HPDC-11), Edinburgh, pp 352–358
Jeong D, Ji S-Y, Suma EA et al (2015) Designing a collaborative visual analytics system to support users’ continuous analytical processes. Human-centric Comput Inf Sci 5(5):1–20
Google Scholar
Kim H, Lee S-H, Sohn M-K et al (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4:9
Article Google Scholar
Li R, Feng W, Wang H (2014) A new parameter estimation method for a zipf-like distribution for geospatial data access. ETRI J 36(1):134–140
Article MathSciNet Google Scholar
Albayram Y, Khan MMH, Bamis A et al (2015) Designing challenge questions for location-based authentication systems: a real-life study. Human-centric Comput Inf Sci 5:17
Article Google Scholar
Li R, Zhang Y, Xu Z (2013) A Load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener Comput Syst 29(22):528–535
Article Google Scholar

Download references

Acknowledgments

The research work reported in this paper is supported by the National Basic Research Program of China (No: 2011CB302306), the National Natural Science Foundation of China (No: 41271398) and the National Natural Science Foundation of China under Grant (No: 61402421).

Author information

Authors and Affiliations

State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University, Luoyu Road 129, Wuhan, 430079, People’s Republic of China
Tao Wang, Shihong Yao, Zhengquan Xu & Shan Jia
Collaborative Innovation Center for Geospatial Technology, Luoyu Road 129, Wuhan, 430079, People’s Republic of China
Tao Wang, Shihong Yao, Zhengquan Xu & Shan Jia

Authors

Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shihong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengquan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shan Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengquan Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, T., Yao, S., Xu, Z. et al. DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems. J Supercomput 72, 2537–2564 (2016). https://doi.org/10.1007/s11227-015-1511-z

Download citation

Published: 30 August 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11227-015-1511-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

Abstract

Access this article

Similar content being viewed by others

Optimization of Tasks Scheduling by an Efficacy Data Placement and Replication in Cloud Computing

A Data-Aware Scheduling Framework for Parallel Applications in a Cloud Environment

Workflow Coordinated Resources Allocation for Big Data Analytics in the Cloud

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

Abstract

Access this article

Similar content being viewed by others

Optimization of Tasks Scheduling by an Efficacy Data Placement and Replication in Cloud Computing

A Data-Aware Scheduling Framework for Parallel Applications in a Cloud Environment

Workflow Coordinated Resources Allocation for Big Data Analytics in the Cloud

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation