Skip to main content
Log in

The Case for Workflow-Aware Storage:An Opportunity Study

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

This article evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing generic storage systems unable to harness all optimization opportunities as this often requires enabling conflicting optimizations or even conflicting design decisions at the storage system level. Second, most workflow runtime engines make suboptimal scheduling decisions as they lack the detailed data location information that is generally hidden by the storage system. This paper presents a limit study that evaluates the potential gains from building a workflow-aware storage system that supports per-file access optimizations and exposes data location. Our evaluation using synthetic benchmarks and real applications shows that a workflow-aware storage system can bring significant performance gains: up to 3x performance gains compared to a vanilla distributed storage system deployed on the same resources yet unaware of the possible file-level optimizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. FUSE: Filesystem in Userspace. 2011. http://fuse.sourceforge.net/ http://fuse.sourceforge.net/

  2. modFTDock: http://www.mybiosoftware.com/3d-molecular-model/922 http://www.mybiosoftware.com/3d-molecular-model/922 (2012)

  3. Laity, A.C., Anagnostou, N., Berriman, G.B., Good, J.C., et al.: Montage: an astronomical image mosaic service for the NVO. In: Proceedings of Astronomical Data Analysis Software and Systems (ADASS) (2004)

  4. Chen, Y., Chen, W., Cobb, M.H., Zhao, Y.: PTMap A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. Proceedings of the National Academy of Sciences of the USA, 2009. 106 (3).

  5. Nemo-Cailliau, C., Glatard, T., Blay-Fornarino, M., Montagnat, J.: Merging overlapping orchestrations: an application to the Bronze Standard medical application. in IEEE International Conference on Services Computing. Salt Lake City, Utah, USA (2007)

  6. Jensen, J.A., Svendsen, N.B.: Calculation of pressure fields from arbitrarily shaped, apodized, and excited ultrasound transducers. IEEE Trans Ultrason. Ultrason. Freq. Control 39 (2), 262–267 (1992)

    Article  Google Scholar 

  7. Raicu, I., Foster, I.T., Zhao, Y.: Many-Task Computing for Grids and Supercomputers. IEEE Work. Many-Task Comput. Grids Supercomputers (2008)

  8. Foster, I., Hategan, M., Wozniak, J.M., Wilde, M., et al.: Swift: A language for distributed parallel scripting. J. Parallel Comput. (2011)

  9. Deelman, E., Singh, G., Su, M.-H., Blythe, J., et al.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. J. Sci. Program. 13 (3), 219–237 (2005)

    Google Scholar 

  10. Makeflow: http://nd.edu/~ccl/software/makeflow/ (2012)

  11. Santos-Neto, E., Al-Kiswany, S., Andrade, N., Gopalakrishnan, S., et al: Beyond search and navigability: Custom metadata can enable cross-layer optimizations in storage systems. ACM/IEEE Int. Symp. High Perform. Distrib. Comput. (HPDC) - Hot Topics Track (2008)

  12. Wozniak, J., Wilde, M.: Case studies in storage access by loosely coupled petascale applications. Petascale Data Storage Work. (2009)

  13. Katz, D.S., Armstrong, T.G., Zhang, Z., Wilde, M., et al.: Many-Task computing and blue waters, in Technical Report CI-TR-13-0911. Computation Institute, University of Chicago & Argonne National Laboratory 2012. arXiv:1202.3943v1

  14. Shibata, T., Choi, S.: File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Int. Symp. High Perform. Distrib. Comput. (HPDC) (2010)

  15. Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., et al.: Characterization of scientific workflows. Work. Workflows Support Large-Scale Sci. (2008)

  16. Yildiz, U., Guabtni, A., Ngu, A.H.H.: Towards scientific workflow patterns, in Workshop on Workflows in Support of Large-Scale Science. 2009

  17. Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., et al.: Explicit control in a batch-aware distributed file system. in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’04). 2004. San Francisco, California.

  18. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. IEEE Symp. Mass Storage Syst. Technol. (MSST) (2010)

  19. Al-Kiswany, S., Gharaibeh, A., Ripeanu, M.: The case for versatile storage system. in workshop on hot topics in storage and file systems (HotStorage) 2009

  20. Raicu, I., Zhao, Y., Foster, I., Szalay, A.: Accelerating Large-scale data exploration through data diffusion. Int. Workshop Data-Aware Distrib. Comput. (2008)

  21. Chervenak, A., Deelman, E., Livny, M., et al.: Data placement for scientific applications in distributed environments in IEEE/ACM International Conference on Grid Computing. 2007

  22. Alvarez, G.A., Borowsky, E., Go, S., Romer, T.H., et al.: Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst. (TOCS) 19 (4) (2001)

  23. Costa, L., Al-Kiswany, S., Barros, A., Yang, H., et al.: Predicting Intermediate Storage Performance for Workflow Applications, in Parallel Data Storage Workshop (PDSW) 2013: Denver

  24. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. in 19th ACM Symposium on Operating Systems Principles. 2003. Lake George, NY.

  25. Gupta, K., Jain, R., Koltsidas, I., Pucha, H., et al.: GPFS-SNC: An enterprise storage framework for virtual-machine clouds IBM Journal of Research and Development 2011

  26. Rosenblum, M., Ousterhout, J.K.: The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems (1992)

  27. Mesnier, M.P., Akers, J.B.: Differentiated storage services. ACM SIGOPS Oper. Syst. Rev. 45 (1) (2011)

  28. Chen, Z., Zhang, Y., Zhou, Y., H. Scott, et al: Empirical evaluation of multi-level buffer cache collaboration for storage systems. Int. Conf. Meas. Model. comput. syst. (SIGMETRICS) (2005)

  29. Patterson, R.H., Gibson, G.A., Ginting, E., Stodolsky, D., et al.: Informed prefetching and caching. ACM Symp. Oper. Syst. Princ. (SOSP) (1995)

  30. Fujimoto, K., Akaike, H., Okada, N., Miura, K., et al.: Power-aware Proactive Storage-tiering Management for gh-speed Tiered-storage Systems. Workshop Sustain. Inf. Technol. (2010)

  31. Mandagere, N., Diehl, J., Du, D.: GreenStor: Application-Aided Energy-Efficient Storage. IEEE Confer. Mass Storage Syst. Technol. (MSST) (2007)

  32. Fedak, G., He, H., Cappello, F.: BitDew: A programmable environment for large-scale data management and distribution. Int. Confer. High Perform. Netw. Comput. (Supercomputing) (2008)

  33. Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., et al.: Falkon: A fast and Light-weight tasK executiON framework. Super Comput. (2007)

  34. Raicu, I., Foster, I., Zhao, Y., Little, P., et al.: The Quest for Scalable Support of Data Intensive Workloads in Distributed Systems. Int. symp. High Perform. Distrib. Comput. (HPDC) (2009)

  35. Zhang, Z., Katz, D., Ripean, M., Wilde, M., et al.: AME: An Anyscale Many-Task Computing Engine. in Workshop on Workflows in Support of Large-Scale Science (2011)

  36. Zhang, Z., Katz, D.S., Wozniak, J M., Espinosa, A., et al.: Design and Analysis of Data Management in Scalable Parallel Scripting. Supercomput. (2012)

  37. Berman, F., Chien, A., Cooper, K., Dongarra, J., et al.: The GrADS Project: Software support for high-level grid application development. Int. J. High Perform. Comput. Appl. 15 (4), 327–344 (2001)

    Article  Google Scholar 

  38. Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Beowulf cluster computing with Linux MIT Press Cambridge, MA, USA Ⓒ2002

  39. gCUBE Framework (2014). http://www.gcube-system.org.

  40. Glatard, T., Montagnat, J., Pennec, X., Emsellem, D., et al.: MOTEUR: A data-intensive service-based workflow manager. University of Nice, France (2006)

    Google Scholar 

  41. NDBM Library (2013). http://infolab.stanford.edu/~ullman/dbsi/win98/ndbm.html.

  42. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., et al.: Pegasus: Mapping Scientific Workflows onto the Grid. Lect. Notes Comput. Sci. Grid Comput. 3165, 11–20

  43. Hall, B.: Beej’s Guide to Network Programming. Jorgensen Publishing (2011)

  44. Jai, R.: The art of computer systems performance analysis. Wiley-Interscience, New York, NY (1991)

    Google Scholar 

  45. Trivedi, K.S., ISBN-10: 0471333417 |ISBN-13: 978-0471333418 Probability and Statistics with Reliability, Queueing, and Computer Science Applications. 2 ed. 2001:. Wiley-Interscience

  46. Amazon simple storage service 2010. http://aws.amazon.com/s3/

  47. Thain, D., Moretti, C., Jeffrey Hemmes: Chirp: A practical global filesystem for cluster and grid computing. J. Grid Comput. 7 (1), 51–72 (2009)

    Article  Google Scholar 

  48. Altschul, S.F., Gish, W., Miller, W., Myers, E., et al.: Basic local alignment search tool. Molecular Biology 215, 403–410 (1990)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Costa, L.B., Yang, H., Vairavanathan, E. et al. The Case for Workflow-Aware Storage:An Opportunity Study. J Grid Computing 13, 95–113 (2015). https://doi.org/10.1007/s10723-014-9307-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-014-9307-6

Keywords

Navigation