Skip to main content
Log in

Self-organized dynamic provisioning for big data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Recent rapid expansion of datasets in big data problems has resulted in data sizes that exceed processing capabilities of available distributed computing power. In other words, we are producing more data than we can process. In addition, further analysis of a dataset collective state may require duplicating, transferring, and distributing to increase the scale of the problem. Orchestrating these steps in large-scale complex systems is non-trivial. One basic technique to help minimize effects of data re-distribution is to use dynamic resource provisioning environments. When the node organization and structure is dynamic and eclectic, provisioning environments require up-to-date information about resource availability. Maintaining freshness of available resource state in centralized or hierarchical scheduling systems imposes a network communication overhead. Centralization also introduces administrative barriers, limiting interoperability. One effective method to improve the extent of self-organization is taking feedback. Based on this feedback, nodes can then alter their behavior to better respond to changing characteristics in dynamic resource provisioning environments. In this article, we present a decentralized scheduling framework that takes feedback from the system, and adjusts its behavior accordingly. Our framework presents an enabling mechanism for self-organization, where each cloud node adapts its behavior based on the feedback. This approach, compared to centralized resource provisioning solutions that exist in current cloud systems, achieves comparable scheduling decisions, with half the packet overhead. We show that by taking advantage of spatial locality with dynamic provisioning, and due to better scheduling decisions with our framework, data processing overhead of big data problems can be reduced by at least 30% in general, and up to 55% in particular resource distributions. This in turn, results in efficient scheduling decisions to provision better resources for big data tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Except in the case where nodes self-organize into neighborhoods in a peer-to-peer fashion.

  2. When Freshness is used as the ranking criteria.

References

  1. Aberer, K., Cudré-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., Punceva, M., Schmidt, R.: P-grid: a self-organizing structured p2p system. SIGMOD Rec. 32(3), 29–33 (2003)

    Article  Google Scholar 

  2. Berman, F., Fox, G., Hey, A.: Grid Computing: Making the Global Infrastructure a Reality, vol. 2. Wiley, NewYork (2003)

    Book  Google Scholar 

  3. Bode, B., Halstead, D., Kendall, R., Lei, Z., Jackson, D.: The portable batch scheduler and the maui scheduler on linux clusters. In: Usenix, 4th Annual Linux Showcase and Conference (2000)

  4. Borthakur, D.: The hadoop distributed file system: architecture and design. Hadoop Project Website 11, 21 (2007)

    Google Scholar 

  5. Chakravarti, A., Baumgartner, G., Lauria, M.: The organic grid: self-organizing computation on a peer-to-peer network. Syst. Man Cybern. A 35(3), 373–384 (2005)

    Article  Google Scholar 

  6. Chapin, S.J., Katramatos, D., Karpovich, J., Grimshaw, A.: Resource management in Legion. Future Gener. Comput. Syst. 15(5–6), 583–594 (1999)

    Article  Google Scholar 

  7. Chase, J., Irwin, D., Grit, L., Moore, J., Sprenkle, S.: Dynamic virtual clusters in a grid site manager. In: High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium, pp. 90–100 (2003)

  8. Cowie, J., Liu, H., Liu, J., Nicol, D., Ogielski, A.: Towards realistic million-node internet simulations. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (1999)

  9. Czajkowski, K., Fitzgerald, S., Foster, I. and Kesselman, C.: Grid information services for distributed resource sharing. In: Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (HPDC-10) (2001)

  10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  11. Dejun, J., Pierre, G., Chi, C.-H.: Autonomous resource provisioning for multi-service web applications. In: Proceedings of the International World-Wide Web Conference (2010)

  12. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry D.: Epidemic algorithms for replicated database maintenance. In: PODC ’87: Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, pp. 1–12. ACM Press, New York (1987)

  13. Desai, R., Tilak, S., Gandhi, B., Lewis, M. J., Abu-Ghazaleh, N. B.: Analysis of query matching criteria and resource monitoring for grid application scheduling. In: Proceedings of CCGrid2006: IEEE International Symposium on Cluster Computing and the Grid (2006)

  14. Drost, N., Ogston, E., van Nieuwpoort, R.V., Bal, H.E.: Arrg: real-world gossiping. In: Proceedings of the 16th IEEE International Symposium on High Performance Distributed Computing (2007)

  15. Dubois, D.J., Casale, G.: Optispot: minimizing application deployment cost using spot cloud resources. Cluster Comput. 19(2), 893–909 (2016)

    Article  Google Scholar 

  16. Epema, D.H.J., Livny, M., van Dantzig, R., Evers, X., Pruyne, J.: A worldwide flock of condors: load sharing among workstation clusters. Technical Report DUT-TWI-95-130, Delft, The Netherlands (1995)

  17. Erdil, D.C., Lewis M.J.: Supporting self-organization for hybrid grid resource scheduling. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1981–1986. SAC ’08, ACM, New York (2008)

  18. Erdil, D.C., Lewis, M.J.: Grid resource scheduling with gossiping protocols. In: Proceedings of the 7th IEEE International Conference, Peer-to-Peer Computing, Dublin, pp. 193–200 (2007)

  19. Erdil, D.C., Lewis, M.J., Abu-Ghazaleh, N.: An adaptive algorithm for information dissemination in self-organizing grids. In: Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing (eScience 2006), Amsterdam, the Netherlands, 4–6 December (2006)

  20. Fritzke, B.: Growing grid a self-organizing network with constant neighborhood range and adaptation strength. Neural Proc. Lett. 2, 9–13 (1995)

    Article  Google Scholar 

  21. Gentzsch, W.: Sun grid engine: towards creating a compute power grid. In: Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium, IEEE, Piscataway, pp. 35–36 (2001)

  22. Goldberg, A.V.: An efficient implementation of a scaling minimum-cost flow algorithm. J. Alg. 22(1), 1–29 (1997)

    Article  MathSciNet  Google Scholar 

  23. Herodotou H., Lim H., Luo G., Borisov N., Dong L., Cetin, F., Babu, S.: Starfish: a self-tuning system for big data analytics. In: Procceeding of the Fifth CIDR Conference (2011)

  24. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D., Kania, R., Schaeffer, M., St Pierre, S., et al.: Big data: the future of biocuration. Nature 455(7209), 47–50 (2008)

    Article  Google Scholar 

  25. Kempe, D., Kleinberg, J., Demers, A.: Spatial gossip and resource location protocols. In: Annual ACM Symposium on Theory of Computing (STOC) (2001)

  26. Kermarrec, A.-M., Massoulie, L., Ganesh, A.J.: Probabilistic relieable dissemination in large-scale systems. In: IEEE Transactions on Parallel and Distributed Systems (2003)

  27. Lehman, T., Sobieski, J., Jabbari, B.: Dragon: a framework for service provisioning in heterogeneous grid networks. Commun. Mag. IEEE 44(3), 84–90 (2006)

    Article  Google Scholar 

  28. Li, L., Halpern, J., Haas, Z.: Gossip-based ad hoc routing. In: IEEE Infocom (2002)

  29. Lynch, C.: Big data: how do your data grow? Nature 455(7209), 28–29 (2008)

    Article  Google Scholar 

  30. Marozzo, F., Talia, D., Trunfio, P.: P2p-mapreduce: parallel data processing in dynamic cloud environments. J. Comput. Syst. Sci. 78, 1382–1402 (2012)

    Article  Google Scholar 

  31. Murphy, M. A., Kagey, B., Fenn, M., Goasguen, S.: Dynamic provisioning of virtual organization clusters. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID ’09, IEEE Computer Society, Washington, pp. 364–371 (2009)

  32. Nottingham, M., Liu, X.: Amazon elastic compute cloud. http://aws.amazon.com/ec2/

  33. Palanisamy, B., Singh, A., Liu, L., Jain B.: Purlieus: locality-aware resource allocation for mapreduce in a cloud. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ACM (2011)

  34. Park, J., Lee, S., Kim, J.M.: An autonomic control system for high-reliable cps. Cluster Comput. 18(2), 587–598 (2015)

    Article  Google Scholar 

  35. Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: a fast and light-weight task execution framework. In: Supercomputing, 2007. SC’07. Proceedings of the 2007 ACM/IEEE Conference, pp. 1–12. IEEE (2007)

  36. Serugendo, G.D., Karageorgos, A., Rana, O.F., Zambonelli, F.: Engineering self-0rganizing systems: Nature-inspired approaches to software engineering. Lecture Notes in Artificial Intelligence, (2977), Berlin, Germany (2004)

  37. Shen, Z., He, J.: Apache Hadoop Yarn: The Next-Generation Distributed Operating System. In ApacheCon North America, Denver (2014)

    Google Scholar 

  38. Van Essen, B., Hsieh, H., Ames, A., Pearce, R., Gokhale, M.: Di-mmap a scalable memory-map runtime for out-of-core data-intensive applications. Cluster Comput. 18(1), 15–28 (2015)

  39. Vijayakumar, S., Zhu, Q., Agrawal, G.: Dynamic resource provisioning for data streaming applications in a cloud environment. In: 2nd IEEE International Conference on Cloud Computing Technology and Science, (2010)

  40. White, T.: Hadoop: The definitive Guide. O’Reilly Media, Sebastopol (2012)

    Google Scholar 

  41. Yalagandula, P., Dahlin, M.: A Scalable Distributed Information Management System. Proceedings of ACM SIGCOMM, Portland (2004)

    Book  Google Scholar 

  42. Zegura, E., Calvert, K.: GT Internetwork Topology Models (GT-ITM). http://www.cc.gatech.edu/projects/gtitm

  43. Zhou, S.: Lsf: Load sharing in large heterogeneous distributed systems. In: I Workshop on Cluster Computing (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Cenk Erdil.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erdil, D.C. Self-organized dynamic provisioning for big data. Cluster Comput 20, 2749–2762 (2017). https://doi.org/10.1007/s10586-017-0822-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0822-7

Keywords

Navigation