Skip to main content
Log in

A data transmission algorithm for distributed computing system based on maximum flow

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Data skew can lead to load imbalance and longer computation time in the distributed computing system. To avoid data skew and reduce the data computation time, it is necessary to transmit the data to appropriate machines, this may however take too much network resources. How to balance the computational resources and the network resources is a problem. In this paper, we introduce a computation model called distributed two-phase model, in which the process of a task can be divided into two independent phases: data transmission and data computation. Suppose an upper bound of relative computation time is given, we show how to schedule data transmission with minimum resources, such as data transmission time and occupied bandwidth, to meet the demand. In this paper, we present a novel algorithm to minimize data transmission time and network bandwidth usage in the data transmission phase, with the conditions that an upper bound of relative computation time of data computation phase is given. Moreover, the number of nodes that participate in data computation phase is also reduced, in this way, the computational resources are saved. The simulation results show that the occupied bandwidth can be reduced effectively (about 70 %) in the situation of large-scale data sets and large number of nodes. Our algorithm is also shown to be available in replication situation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Apache hadoop. http://www.hadoop.apache.org

  2. Ahuja, R.K., Goldberg, A.V., Orlin, J.B., Tarjan, R.E.: Finding minimum-cost flows by double scaling. Math. Program. 53(1–3), 243–266 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  3. Buyya, R.: Parmon: a portable and scalable monitoring system for clusters. Software 30(7), 723–740 (2000)

    MATH  Google Scholar 

  4. Cherkassky, B.V., Goldberg, A.V.: On implementing the pushrelabel method for the maximum flow problem. Algorithmica 19(4), 390–410 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  5. Christiano, P., Kelner, J.A., Madry, A., Spielman, D.A., Teng, S.H.: Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, pp. 273–282. ACM Press, San Jose (2011)

  6. Cidon, A., Rumble, S., Stutsman, R., Katti, S., Ousterhout, J., Rosenblum, M.: Copysets: reducing the frequency of data loss in cloud storage. In: Presented as part of the 2013 USENIX Annual Technical Conference, pp. 37–48. USENIX (2013)

  7. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.L., et al.: Introduction to Algorithms. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). doi:10.1145/1629175.1629198

    Article  Google Scholar 

  9. Dinic, E.: Algorithm for solution of a problem of maximum flow in a network with power estimation. Soviet Math. Doll. 11(5), 1277–1280, (1970) (English translation by RF. Rinehart)

  10. Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM (JACM) 19(2), 248–264 (1972)

    Article  MATH  Google Scholar 

  11. Ford, D., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (2010)

    Google Scholar 

  12. Ford, L.R., Fulkerson, D.R.: Maximal flow through a network. Can. J. Math. 8(3), 399–404 (1956)

    Article  MATH  MathSciNet  Google Scholar 

  13. Goldberg, A.V., Rao, S.: Beyond the flow decomposition barrier. J. ACM (JACM) 45(5), 783–797 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  14. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling Data Skew in Mapreduce, pp. 574–583. Eindhoven University of Technology, Noordwijkerhout (2011)

    Google Scholar 

  15. Helal, A.S., Yuan, D., Hesham, E.R.: Dynamic data reallocation for skew management in shared-nothing parallel databases. Distrib. Parallel Databases 5(3), 271–288 (1997)

  16. Hsiao, H.C., Chung, H.Y., Shen, H., Chao, Y.C.: Load rebalancing for distributed file systems in clouds. IEEE Trans. Parallel Distrib. Syst. 24(5), 951–962 (2013). doi:10.1109/TPDS.2012.196

    Article  Google Scholar 

  17. Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Wu, S.: Handling partitioning skew in mapreduce using leen. Peer-to-Peer Netw. Appl. 6(4), 409–424 (2013)

    Article  Google Scholar 

  18. Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring, pp. 23–28. ACM Press, New York (2007)

  19. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007). doi:10.1145/1272998.1273005

    Article  Google Scholar 

  20. Jin, J., Luo, J., Song, A., Dong, F., Xiong, R.: Bar: an efficient data locality driven task scheduling algorithm for cloud computing. In: Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, pp. 295–304. IEEE Press, New York (2011)

  21. Kliazovich, D., Bouvry, P., Khan, S.U.: Dens: data center energy-efficient network-aware scheduling. Clust. Comput. 16(1), 65–75 (2013)

    Article  Google Scholar 

  22. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM Press, New York (2012)

  23. Li, M., Subhraveti, D., Butt, A.R., Khasymski, A., Sarkar, P.: Cam: a topology aware minimum cost flow based resource manager for mapreduce applications in the cloud. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 211–222. ACM Press, Hoboken (2012)

  24. Lu, H., Yu, J.X., Feng, L., Li, Z.: Fully dynamic partitioning: handling data skew in parallel data cube computation. Distrib. Parallel Databases 13(2), 181–202 (2003)

    Article  MATH  Google Scholar 

  25. Märtens, H.: A classification of skew effects in parallel database systems. In: Euro-Par 2001 Parallel Processing, pp. 291–300. Springer, New York (2001)

  26. Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  27. Run-liu, W., Yun-hui, Y.: Low cost network coding algorithm for data distribution network. In: Proceedings of 8th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM), pp. 1–4 (2012). doi:10.1109/WiCOM.2012.6478566

  28. Schrijver, A.: On the history of combinatorial optimization (till 1960). Handbook of Discrete Optimization pp. 1–68 (2005)

  29. Slagter, K., Hsu, C.H., Chung, Y.C., Yi, G.: Smartjoin: a network-aware multiway join for mapreduce. Clust. Comput. 17, 1–13 (2014)

    Article  Google Scholar 

  30. Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the 36h Annual ACM Symposium on Theory of Computing, pp. 81–90. ACM Press, New York (2004)

  31. Vygen, J.: On dual minimum cost flow algorithms. Math. Methods Oper. Res. 56(1), 101–126 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  32. Yook, J., Tilbury, D.: Performance evaluation of distributed control systems with reduced communication. Ann Arbor 1001, 48,109 (2001)

    Google Scholar 

  33. Yook, J.K., Tilbury, D.M., Soparkar, N.R.: Trading computation for bandwidth: reducing communication in distributed control systems using state estimators. IEEE Trans. Control Syst. Technol. 10(4), 503–518 (2002)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the National 863 Project (2011AA040101) and was jointly funded by the Beijing municipal Education Commission of the Scientic Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaotong Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Jiang, J., Zhang, X. et al. A data transmission algorithm for distributed computing system based on maximum flow. Cluster Comput 18, 1157–1169 (2015). https://doi.org/10.1007/s10586-015-0467-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0467-3

Keywords

Navigation