A data transmission algorithm for distributed computing system based on maximum flow

Zhang, Xiaolu; Jiang, Jiafu; Zhang, Xiaotong; Wang, Xuan

doi:10.1007/s10586-015-0467-3

A data transmission algorithm for distributed computing system based on maximum flow

Published: 15 July 2015

Volume 18, pages 1157–1169, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Xiaolu Zhang¹,
Jiafu Jiang¹,
Xiaotong Zhang¹ &
…
Xuan Wang¹

246 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

Data skew can lead to load imbalance and longer computation time in the distributed computing system. To avoid data skew and reduce the data computation time, it is necessary to transmit the data to appropriate machines, this may however take too much network resources. How to balance the computational resources and the network resources is a problem. In this paper, we introduce a computation model called distributed two-phase model, in which the process of a task can be divided into two independent phases: data transmission and data computation. Suppose an upper bound of relative computation time is given, we show how to schedule data transmission with minimum resources, such as data transmission time and occupied bandwidth, to meet the demand. In this paper, we present a novel algorithm to minimize data transmission time and network bandwidth usage in the data transmission phase, with the conditions that an upper bound of relative computation time of data computation phase is given. Moreover, the number of nodes that participate in data computation phase is also reduced, in this way, the computational resources are saved. The simulation results show that the occupied bandwidth can be reduced effectively (about 70 %) in the situation of large-scale data sets and large number of nodes. Our algorithm is also shown to be available in replication situation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Topology-Aware Scheduling Strategy for Distributed Stream Computing System

Cost-Effective Data Partition for Distributed Stream Processing System

A3-Storm: topology-, traffic-, and resource-aware storm scheduler for heterogeneous clusters

Article 06 May 2020

References

Apache hadoop. http://www.hadoop.apache.org
Ahuja, R.K., Goldberg, A.V., Orlin, J.B., Tarjan, R.E.: Finding minimum-cost flows by double scaling. Math. Program. 53(1–3), 243–266 (1992)
Article MATH MathSciNet Google Scholar
Buyya, R.: Parmon: a portable and scalable monitoring system for clusters. Software 30(7), 723–740 (2000)
MATH Google Scholar
Cherkassky, B.V., Goldberg, A.V.: On implementing the pushrelabel method for the maximum flow problem. Algorithmica 19(4), 390–410 (1997)
Article MATH MathSciNet Google Scholar
Christiano, P., Kelner, J.A., Madry, A., Spielman, D.A., Teng, S.H.: Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, pp. 273–282. ACM Press, San Jose (2011)
Cidon, A., Rumble, S., Stutsman, R., Katti, S., Ousterhout, J., Rosenblum, M.: Copysets: reducing the frequency of data loss in cloud storage. In: Presented as part of the 2013 USENIX Annual Technical Conference, pp. 37–48. USENIX (2013)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.L., et al.: Introduction to Algorithms. MIT Press, Cambridge (2001)
MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). doi:10.1145/1629175.1629198
Article Google Scholar
Dinic, E.: Algorithm for solution of a problem of maximum flow in a network with power estimation. Soviet Math. Doll. 11(5), 1277–1280, (1970) (English translation by RF. Rinehart)
Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM (JACM) 19(2), 248–264 (1972)
Article MATH Google Scholar
Ford, D., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (2010)
Google Scholar
Ford, L.R., Fulkerson, D.R.: Maximal flow through a network. Can. J. Math. 8(3), 399–404 (1956)
Article MATH MathSciNet Google Scholar
Goldberg, A.V., Rao, S.: Beyond the flow decomposition barrier. J. ACM (JACM) 45(5), 783–797 (1998)
Article MATH MathSciNet Google Scholar
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling Data Skew in Mapreduce, pp. 574–583. Eindhoven University of Technology, Noordwijkerhout (2011)
Google Scholar
Helal, A.S., Yuan, D., Hesham, E.R.: Dynamic data reallocation for skew management in shared-nothing parallel databases. Distrib. Parallel Databases 5(3), 271–288 (1997)
Hsiao, H.C., Chung, H.Y., Shen, H., Chao, Y.C.: Load rebalancing for distributed file systems in clouds. IEEE Trans. Parallel Distrib. Syst. 24(5), 951–962 (2013). doi:10.1109/TPDS.2012.196
Article Google Scholar
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Wu, S.: Handling partitioning skew in mapreduce using leen. Peer-to-Peer Netw. Appl. 6(4), 409–424 (2013)
Article Google Scholar
Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring, pp. 23–28. ACM Press, New York (2007)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007). doi:10.1145/1272998.1273005
Article Google Scholar
Jin, J., Luo, J., Song, A., Dong, F., Xiong, R.: Bar: an efficient data locality driven task scheduling algorithm for cloud computing. In: Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, pp. 295–304. IEEE Press, New York (2011)
Kliazovich, D., Bouvry, P., Khan, S.U.: Dens: data center energy-efficient network-aware scheduling. Clust. Comput. 16(1), 65–75 (2013)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM Press, New York (2012)
Li, M., Subhraveti, D., Butt, A.R., Khasymski, A., Sarkar, P.: Cam: a topology aware minimum cost flow based resource manager for mapreduce applications in the cloud. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 211–222. ACM Press, Hoboken (2012)
Lu, H., Yu, J.X., Feng, L., Li, Z.: Fully dynamic partitioning: handling data skew in parallel data cube computation. Distrib. Parallel Databases 13(2), 181–202 (2003)
Article MATH Google Scholar
Märtens, H.: A classification of skew effects in parallel database systems. In: Euro-Par 2001 Parallel Processing, pp. 291–300. Springer, New York (2001)
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
Run-liu, W., Yun-hui, Y.: Low cost network coding algorithm for data distribution network. In: Proceedings of 8th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM), pp. 1–4 (2012). doi:10.1109/WiCOM.2012.6478566
Schrijver, A.: On the history of combinatorial optimization (till 1960). Handbook of Discrete Optimization pp. 1–68 (2005)
Slagter, K., Hsu, C.H., Chung, Y.C., Yi, G.: Smartjoin: a network-aware multiway join for mapreduce. Clust. Comput. 17, 1–13 (2014)
Article Google Scholar
Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the 36h Annual ACM Symposium on Theory of Computing, pp. 81–90. ACM Press, New York (2004)
Vygen, J.: On dual minimum cost flow algorithms. Math. Methods Oper. Res. 56(1), 101–126 (2002)
Article MATH MathSciNet Google Scholar
Yook, J., Tilbury, D.: Performance evaluation of distributed control systems with reduced communication. Ann Arbor 1001, 48,109 (2001)
Google Scholar
Yook, J.K., Tilbury, D.M., Soparkar, N.R.: Trading computation for bandwidth: reducing communication in distributed control systems using state estimators. IEEE Trans. Control Syst. Technol. 10(4), 503–518 (2002)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National 863 Project (2011AA040101) and was jointly funded by the Beijing municipal Education Commission of the Scientic Research.

Author information

Authors and Affiliations

University of Science and Technology Beijing, Beijing, China
Xiaolu Zhang, Jiafu Jiang, Xiaotong Zhang & Xuan Wang

Authors

Xiaolu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Jiafu Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Xiaotong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaotong Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, X., Jiang, J., Zhang, X. et al. A data transmission algorithm for distributed computing system based on maximum flow. Cluster Comput 18, 1157–1169 (2015). https://doi.org/10.1007/s10586-015-0467-3

Download citation

Received: 29 March 2014
Revised: 20 May 2015
Accepted: 01 June 2015
Published: 15 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10586-015-0467-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A data transmission algorithm for distributed computing system based on maximum flow

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Topology-Aware Scheduling Strategy for Distributed Stream Computing System

Cost-Effective Data Partition for Distributed Stream Processing System

A3-Storm: topology-, traffic-, and resource-aware storm scheduler for heterogeneous clusters

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now