Abstract
This article evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing generic storage systems unable to harness all optimization opportunities as this often requires enabling conflicting optimizations or even conflicting design decisions at the storage system level. Second, most workflow runtime engines make suboptimal scheduling decisions as they lack the detailed data location information that is generally hidden by the storage system. This paper presents a limit study that evaluates the potential gains from building a workflow-aware storage system that supports per-file access optimizations and exposes data location. Our evaluation using synthetic benchmarks and real applications shows that a workflow-aware storage system can bring significant performance gains: up to 3x performance gains compared to a vanilla distributed storage system deployed on the same resources yet unaware of the possible file-level optimizations.
Similar content being viewed by others
References
FUSE: Filesystem in Userspace. 2011. http://fuse.sourceforge.net/ http://fuse.sourceforge.net/
modFTDock: http://www.mybiosoftware.com/3d-molecular-model/922 http://www.mybiosoftware.com/3d-molecular-model/922 (2012)
Laity, A.C., Anagnostou, N., Berriman, G.B., Good, J.C., et al.: Montage: an astronomical image mosaic service for the NVO. In: Proceedings of Astronomical Data Analysis Software and Systems (ADASS) (2004)
Chen, Y., Chen, W., Cobb, M.H., Zhao, Y.: PTMap A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. Proceedings of the National Academy of Sciences of the USA, 2009. 106 (3).
Nemo-Cailliau, C., Glatard, T., Blay-Fornarino, M., Montagnat, J.: Merging overlapping orchestrations: an application to the Bronze Standard medical application. in IEEE International Conference on Services Computing. Salt Lake City, Utah, USA (2007)
Jensen, J.A., Svendsen, N.B.: Calculation of pressure fields from arbitrarily shaped, apodized, and excited ultrasound transducers. IEEE Trans Ultrason. Ultrason. Freq. Control 39 (2), 262–267 (1992)
Raicu, I., Foster, I.T., Zhao, Y.: Many-Task Computing for Grids and Supercomputers. IEEE Work. Many-Task Comput. Grids Supercomputers (2008)
Foster, I., Hategan, M., Wozniak, J.M., Wilde, M., et al.: Swift: A language for distributed parallel scripting. J. Parallel Comput. (2011)
Deelman, E., Singh, G., Su, M.-H., Blythe, J., et al.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. J. Sci. Program. 13 (3), 219–237 (2005)
Makeflow: http://nd.edu/~ccl/software/makeflow/ (2012)
Santos-Neto, E., Al-Kiswany, S., Andrade, N., Gopalakrishnan, S., et al: Beyond search and navigability: Custom metadata can enable cross-layer optimizations in storage systems. ACM/IEEE Int. Symp. High Perform. Distrib. Comput. (HPDC) - Hot Topics Track (2008)
Wozniak, J., Wilde, M.: Case studies in storage access by loosely coupled petascale applications. Petascale Data Storage Work. (2009)
Katz, D.S., Armstrong, T.G., Zhang, Z., Wilde, M., et al.: Many-Task computing and blue waters, in Technical Report CI-TR-13-0911. Computation Institute, University of Chicago & Argonne National Laboratory 2012. arXiv:1202.3943v1
Shibata, T., Choi, S.: File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Int. Symp. High Perform. Distrib. Comput. (HPDC) (2010)
Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., et al.: Characterization of scientific workflows. Work. Workflows Support Large-Scale Sci. (2008)
Yildiz, U., Guabtni, A., Ngu, A.H.H.: Towards scientific workflow patterns, in Workshop on Workflows in Support of Large-Scale Science. 2009
Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., et al.: Explicit control in a batch-aware distributed file system. in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’04). 2004. San Francisco, California.
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. IEEE Symp. Mass Storage Syst. Technol. (MSST) (2010)
Al-Kiswany, S., Gharaibeh, A., Ripeanu, M.: The case for versatile storage system. in workshop on hot topics in storage and file systems (HotStorage) 2009
Raicu, I., Zhao, Y., Foster, I., Szalay, A.: Accelerating Large-scale data exploration through data diffusion. Int. Workshop Data-Aware Distrib. Comput. (2008)
Chervenak, A., Deelman, E., Livny, M., et al.: Data placement for scientific applications in distributed environments in IEEE/ACM International Conference on Grid Computing. 2007
Alvarez, G.A., Borowsky, E., Go, S., Romer, T.H., et al.: Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst. (TOCS) 19 (4) (2001)
Costa, L., Al-Kiswany, S., Barros, A., Yang, H., et al.: Predicting Intermediate Storage Performance for Workflow Applications, in Parallel Data Storage Workshop (PDSW) 2013: Denver
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. in 19th ACM Symposium on Operating Systems Principles. 2003. Lake George, NY.
Gupta, K., Jain, R., Koltsidas, I., Pucha, H., et al.: GPFS-SNC: An enterprise storage framework for virtual-machine clouds IBM Journal of Research and Development 2011
Rosenblum, M., Ousterhout, J.K.: The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems (1992)
Mesnier, M.P., Akers, J.B.: Differentiated storage services. ACM SIGOPS Oper. Syst. Rev. 45 (1) (2011)
Chen, Z., Zhang, Y., Zhou, Y., H. Scott, et al: Empirical evaluation of multi-level buffer cache collaboration for storage systems. Int. Conf. Meas. Model. comput. syst. (SIGMETRICS) (2005)
Patterson, R.H., Gibson, G.A., Ginting, E., Stodolsky, D., et al.: Informed prefetching and caching. ACM Symp. Oper. Syst. Princ. (SOSP) (1995)
Fujimoto, K., Akaike, H., Okada, N., Miura, K., et al.: Power-aware Proactive Storage-tiering Management for gh-speed Tiered-storage Systems. Workshop Sustain. Inf. Technol. (2010)
Mandagere, N., Diehl, J., Du, D.: GreenStor: Application-Aided Energy-Efficient Storage. IEEE Confer. Mass Storage Syst. Technol. (MSST) (2007)
Fedak, G., He, H., Cappello, F.: BitDew: A programmable environment for large-scale data management and distribution. Int. Confer. High Perform. Netw. Comput. (Supercomputing) (2008)
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., et al.: Falkon: A fast and Light-weight tasK executiON framework. Super Comput. (2007)
Raicu, I., Foster, I., Zhao, Y., Little, P., et al.: The Quest for Scalable Support of Data Intensive Workloads in Distributed Systems. Int. symp. High Perform. Distrib. Comput. (HPDC) (2009)
Zhang, Z., Katz, D., Ripean, M., Wilde, M., et al.: AME: An Anyscale Many-Task Computing Engine. in Workshop on Workflows in Support of Large-Scale Science (2011)
Zhang, Z., Katz, D.S., Wozniak, J M., Espinosa, A., et al.: Design and Analysis of Data Management in Scalable Parallel Scripting. Supercomput. (2012)
Berman, F., Chien, A., Cooper, K., Dongarra, J., et al.: The GrADS Project: Software support for high-level grid application development. Int. J. High Perform. Comput. Appl. 15 (4), 327–344 (2001)
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Beowulf cluster computing with Linux MIT Press Cambridge, MA, USA Ⓒ2002
gCUBE Framework (2014). http://www.gcube-system.org.
Glatard, T., Montagnat, J., Pennec, X., Emsellem, D., et al.: MOTEUR: A data-intensive service-based workflow manager. University of Nice, France (2006)
NDBM Library (2013). http://infolab.stanford.edu/~ullman/dbsi/win98/ndbm.html.
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., et al.: Pegasus: Mapping Scientific Workflows onto the Grid. Lect. Notes Comput. Sci. Grid Comput. 3165, 11–20
Hall, B.: Beej’s Guide to Network Programming. Jorgensen Publishing (2011)
Jai, R.: The art of computer systems performance analysis. Wiley-Interscience, New York, NY (1991)
Trivedi, K.S., ISBN-10: 0471333417 |ISBN-13: 978-0471333418 Probability and Statistics with Reliability, Queueing, and Computer Science Applications. 2 ed. 2001:. Wiley-Interscience
Amazon simple storage service 2010. http://aws.amazon.com/s3/
Thain, D., Moretti, C., Jeffrey Hemmes: Chirp: A practical global filesystem for cluster and grid computing. J. Grid Comput. 7 (1), 51–72 (2009)
Altschul, S.F., Gish, W., Miller, W., Myers, E., et al.: Basic local alignment search tool. Molecular Biology 215, 403–410 (1990)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Costa, L.B., Yang, H., Vairavanathan, E. et al. The Case for Workflow-Aware Storage:An Opportunity Study. J Grid Computing 13, 95–113 (2015). https://doi.org/10.1007/s10723-014-9307-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-014-9307-6