Abstract
Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such “many-task” computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.
Similar content being viewed by others
References
Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.: The Lorel query language for semistructured sata. Int. J. Digit. Libr. 1, 66–88 (1997)
Adhikari, A., Peng, J., Wilde, M., Xu, J., Freed, K., Sosnick, T.: Modeling large regions in proteins: applications to loops, termini, and folding. Protein Sci. 21(1), 107–121 (2012)
Anand, M., Bowers, S., McPhillips, T., Ludäscher, B.: Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol. 5566, pp. 237–254. Springer, Berlin (2009)
Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010)
Clifford, B., Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Tracking provenance in a virtual data grid. Concurr. Comput. 20(5), 575 (2008)
da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Proc. IEEE Congress on Services, Part I (SERVICES I 2009), pp. 259–266 (2009)
Dries, A., Nijssen, S.: Analyzing graph databases by aggregate queries. In: Proc. Workshop on Mining and Learning with Graphs (MLG 2010), pp. 37–45 (2010)
Dun, N., Taura, K., Yonezawa, A.: ParaTrac: a fine-grained profiler for data-intensive workflows. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC’10, pp. 37–48. ACM Press, New York (2010)
Foster, I., Vökler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. International Conference on Scientific and Statistical Database Management (SSDBM 2002), pp. 37–46. IEEE Computer Society, Los Alamitos (2002)
Freire, J., Silva, C., Callahan, S., Santos, E., Scheidegger, C., Vo, H.: Managing rapidly-evolving scientific workflows. In: Provenance and Annotation of Data. Lecture Notes in Computer Science, vol. 4145, pp. 10–18. Springer, Berlin (2006)
Furlani, T., Jones, M., Gallo, S., Bruno, A., Lu, C., Ghadersohi, A., Gentner, R., Patra, A., DeLeon, R., von Laszewski, G., Wang, L., Zimmerman, A.: Performance metrics and auditing framework for high performance computer systems. In: Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery, TG ’11, p. 16:1. ACM Press, New York (2011)
Gadelha, L., Mattoso, M.: Kairos: an architecture for securing authorship and temporal information of provenance data in grid-enabled workflow management systems. In: IEEE Fourth International Conference on eScience (e-Science 2008), pp. 597–602. IEEE, New York (2008)
Gadelha, L., Clifford, B., Mattoso, M., Wilde, M., Foster, I.: Provenance management in Swift. Future Gener. Comput. Syst. 27(6), 780 (2011)
Gadelha, L., Mattoso, M., Wilde, M., Foster, I.: Provenance query patterns for many-task scientific computations. In: Proceedings of the 3rd USENIX Workshop on Theory and Applications of Provenance (TaPP’11) (2011)
Goth, G.: The science of better science. Commun. ACM 55(2), 13–15 (2012)
Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM Press, New York (2007)
Katz, D., Armstrong, T., Zhang, Z., Wilde, M., Wozniak, J.: Many-task computing and blue waters. arXiv:1202.3943, February 2012
Liew, C., Atkinson, M., Ostrowski, R., Cole, M., van Hemert, J., Han, L.: Performance database: capturing data for optimizing distributed streaming workflows. Philos. Trans. R. Soc., Math. Phys. Eng. Sci. 369(1949), 3268–3284 (2011)
Mattoso, M., Werner, C., Travassos, G., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., Murta, L.: Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integration Manag. 5(1), 79–92 (2010)
Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011)
Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-Science. J. Grid Comput. 5(1), 1–25 (2007)
Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)
Moreau, L., Missier, P., Belhajjame, K., Cresswell, S., Gil, Y., Golden, R., Groth, P., Klyne, G., McCusker, J., Miles, S., Myers, J., Sahoo, S.: The PROV data model and abstract syntax notation. Technical report, World Wide Web Consortium (W3C), December 2011
Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.: Layering in provenance systems. In: Proc. of the USENIX Annual Technical Conference (2009)
Ogasawara, E., de Oliveira, D., Valduriez, P., Dias, J., Porto, F., Mattoso, M.: An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4(12), 1339 (2011)
Ordonez, C.: Optimizing recursive queries in SQL. In: Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 834–839 (2005)
Provenance working group: http://www.w3.org/2011/prov/wiki/Main_Page (2012)
Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2008, pp. 1–11, November 2008. IEEE Press, New York (2008)
Scheidegger, C., Koop, D., Santos, E., Vo, H., Callahan, S., Freire, J., Silva, C.: Tackling the provenance challenge one layer at a time. Concurr. Comput. 20(5), 473–483 (2008)
Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)
Technology Audit and Insertion Service for TeraGrid: http://www.si.umich.edu/research/project/technology-audit-and-insertion-service-teragrid (2012)
White, R., Roth, R.: Exploratory Search: Beyond the Query–Response Paradigm. Morgan & Claypool, San Rafael (2009)
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the ASKALON grid environment. SIGMOD Rec. 34(3), 56–62 (2005)
Wilde, M., Hategan, M., Wozniak, J., Clifford, B., Katz, D., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 634–652 (2011)
XSEDE—Extreme Science and Engineering Discovery Environment: https://www.xsede.org (2012)
Yu, C., Jagadish, H.V.: Schema summarization. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 319–330. VLDB Endowment, New York (2006)
Zhao, Y., Lu, S.: A logic programming approach to scientific workflow provenance querying. In: Provenance and Annotation of Data and Processes (IPAW 2008). Lecture Notes in Computer Science, vol. 5272, pp. 31–44. Springer, Berlin (2008)
Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. In: Proc. 1st International Provenance and Annotation Workshop (IPAW 2006). Lecture Notes in Computer Science, vol. 4145, pp. 148–161. Springer, Berlin (2006)
Zhao, Y., Hategan, M., Clifford, B., Foster, I., Laszewski, G., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: fast, reliable, loosely coupled parallel computation. In: Proc. 1st IEEE International Workshop on Scientific Workflows (SWF 2007), pp. 199–206 (2007)
Acknowledgements
This work was supported in part by CAPES, CNPq, by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and by NSF under awards OCI-0944332 and OCI-1007115. We thank Swift users Aashish Adhikari, Andrey Rzhetsky and Jon Monette, for providing and running applications using MTCProv, and for helping us understand their provenance requirements.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Judy Qiu and Dennis Gannon.
Rights and permissions
About this article
Cite this article
Gadelha, L.M.R., Wilde, M., Mattoso, M. et al. MTCProv: a practical provenance query framework for many-task scientific computing. Distrib Parallel Databases 30, 351–370 (2012). https://doi.org/10.1007/s10619-012-7104-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-012-7104-4