MTCProv: a practical provenance query framework for many-task scientific computing

Gadelha, Luiz M. R.; Wilde, Michael; Mattoso, Marta; Foster, Ian

doi:10.1007/s10619-012-7104-4

MTCProv: a practical provenance query framework for many-task scientific computing

Published: 17 August 2012

Volume 30, pages 351–370, (2012)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Luiz M. R. Gadelha Jr.^1,2,
Michael Wilde^3,4,
Marta Mattoso¹ &
…
Ian Foster^3,4,5

377 Accesses
15 Citations
Explore all metrics

Abstract

Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such “many-task” computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications

Workflow Provenance for Big Data: From Modelling to Reporting

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.: The Lorel query language for semistructured sata. Int. J. Digit. Libr. 1, 66–88 (1997)
Article Google Scholar
Adhikari, A., Peng, J., Wilde, M., Xu, J., Freed, K., Sosnick, T.: Modeling large regions in proteins: applications to loops, termini, and folding. Protein Sci. 21(1), 107–121 (2012)
Article Google Scholar
Anand, M., Bowers, S., McPhillips, T., Ludäscher, B.: Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol. 5566, pp. 237–254. Springer, Berlin (2009)
Chapter Google Scholar
Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010)
Article Google Scholar
Clifford, B., Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Tracking provenance in a virtual data grid. Concurr. Comput. 20(5), 575 (2008)
Article Google Scholar
da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Proc. IEEE Congress on Services, Part I (SERVICES I 2009), pp. 259–266 (2009)
Google Scholar
Dries, A., Nijssen, S.: Analyzing graph databases by aggregate queries. In: Proc. Workshop on Mining and Learning with Graphs (MLG 2010), pp. 37–45 (2010)
Chapter Google Scholar
Dun, N., Taura, K., Yonezawa, A.: ParaTrac: a fine-grained profiler for data-intensive workflows. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC’10, pp. 37–48. ACM Press, New York (2010)
Chapter Google Scholar
Foster, I., Vökler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. International Conference on Scientific and Statistical Database Management (SSDBM 2002), pp. 37–46. IEEE Computer Society, Los Alamitos (2002)
Chapter Google Scholar
Freire, J., Silva, C., Callahan, S., Santos, E., Scheidegger, C., Vo, H.: Managing rapidly-evolving scientific workflows. In: Provenance and Annotation of Data. Lecture Notes in Computer Science, vol. 4145, pp. 10–18. Springer, Berlin (2006)
Chapter Google Scholar
Furlani, T., Jones, M., Gallo, S., Bruno, A., Lu, C., Ghadersohi, A., Gentner, R., Patra, A., DeLeon, R., von Laszewski, G., Wang, L., Zimmerman, A.: Performance metrics and auditing framework for high performance computer systems. In: Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery, TG ’11, p. 16:1. ACM Press, New York (2011)
Google Scholar
Gadelha, L., Mattoso, M.: Kairos: an architecture for securing authorship and temporal information of provenance data in grid-enabled workflow management systems. In: IEEE Fourth International Conference on eScience (e-Science 2008), pp. 597–602. IEEE, New York (2008)
Chapter Google Scholar
Gadelha, L., Clifford, B., Mattoso, M., Wilde, M., Foster, I.: Provenance management in Swift. Future Gener. Comput. Syst. 27(6), 780 (2011)
Google Scholar
Gadelha, L., Mattoso, M., Wilde, M., Foster, I.: Provenance query patterns for many-task scientific computations. In: Proceedings of the 3rd USENIX Workshop on Theory and Applications of Provenance (TaPP’11) (2011)
Google Scholar
Goth, G.: The science of better science. Commun. ACM 55(2), 13–15 (2012)
Article Google Scholar
Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM Press, New York (2007)
Google Scholar
Katz, D., Armstrong, T., Zhang, Z., Wilde, M., Wozniak, J.: Many-task computing and blue waters. arXiv:1202.3943, February 2012
Liew, C., Atkinson, M., Ostrowski, R., Cole, M., van Hemert, J., Han, L.: Performance database: capturing data for optimizing distributed streaming workflows. Philos. Trans. R. Soc., Math. Phys. Eng. Sci. 369(1949), 3268–3284 (2011)
Article Google Scholar
Mattoso, M., Werner, C., Travassos, G., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., Murta, L.: Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integration Manag. 5(1), 79–92 (2010)
Article Google Scholar
Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011)
Article Google Scholar
Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-Science. J. Grid Comput. 5(1), 1–25 (2007)
Article Google Scholar
Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)
Article Google Scholar
Moreau, L., Missier, P., Belhajjame, K., Cresswell, S., Gil, Y., Golden, R., Groth, P., Klyne, G., McCusker, J., Miles, S., Myers, J., Sahoo, S.: The PROV data model and abstract syntax notation. Technical report, World Wide Web Consortium (W3C), December 2011
Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.: Layering in provenance systems. In: Proc. of the USENIX Annual Technical Conference (2009)
Google Scholar
Ogasawara, E., de Oliveira, D., Valduriez, P., Dias, J., Porto, F., Mattoso, M.: An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4(12), 1339 (2011)
Google Scholar
Ordonez, C.: Optimizing recursive queries in SQL. In: Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 834–839 (2005)
Chapter Google Scholar
Provenance working group: http://www.w3.org/2011/prov/wiki/Main_Page (2012)
Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2008, pp. 1–11, November 2008. IEEE Press, New York (2008)
Chapter Google Scholar
Scheidegger, C., Koop, D., Santos, E., Vo, H., Callahan, S., Freire, J., Silva, C.: Tackling the provenance challenge one layer at a time. Concurr. Comput. 20(5), 473–483 (2008)
Article Google Scholar
Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)
Article Google Scholar
Technology Audit and Insertion Service for TeraGrid: http://www.si.umich.edu/research/project/technology-audit-and-insertion-service-teragrid (2012)
White, R., Roth, R.: Exploratory Search: Beyond the Query–Response Paradigm. Morgan & Claypool, San Rafael (2009)
Google Scholar
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the ASKALON grid environment. SIGMOD Rec. 34(3), 56–62 (2005)
Article Google Scholar
Wilde, M., Hategan, M., Wozniak, J., Clifford, B., Katz, D., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 634–652 (2011)
Article Google Scholar
XSEDE—Extreme Science and Engineering Discovery Environment: https://www.xsede.org (2012)
Yu, C., Jagadish, H.V.: Schema summarization. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 319–330. VLDB Endowment, New York (2006)
Google Scholar
Zhao, Y., Lu, S.: A logic programming approach to scientific workflow provenance querying. In: Provenance and Annotation of Data and Processes (IPAW 2008). Lecture Notes in Computer Science, vol. 5272, pp. 31–44. Springer, Berlin (2008)
Chapter Google Scholar
Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. In: Proc. 1st International Provenance and Annotation Workshop (IPAW 2006). Lecture Notes in Computer Science, vol. 4145, pp. 148–161. Springer, Berlin (2006)
Google Scholar
Zhao, Y., Hategan, M., Clifford, B., Foster, I., Laszewski, G., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: fast, reliable, loosely coupled parallel computation. In: Proc. 1st IEEE International Workshop on Scientific Workflows (SWF 2007), pp. 199–206 (2007)
Google Scholar

Download references

Acknowledgements

This work was supported in part by CAPES, CNPq, by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and by NSF under awards OCI-0944332 and OCI-1007115. We thank Swift users Aashish Adhikari, Andrey Rzhetsky and Jon Monette, for providing and running applications using MTCProv, and for helping us understand their provenance requirements.

Author information

Authors and Affiliations

Computer Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Luiz M. R. Gadelha Jr. & Marta Mattoso
National Laboratory for Scientific Computing, Petrópolis, Brazil
Luiz M. R. Gadelha Jr.
Mathematics and Computer Science Division, Argonne National Laboratory, Chicago, USA
Michael Wilde & Ian Foster
Computation Institute, Argonne National Laboratory and University of Chicago, Chicago, USA
Michael Wilde & Ian Foster
Department of Computer Science, University of Chicago, Chicago, USA
Ian Foster

Authors

Luiz M. R. Gadelha Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Michael Wilde
View author publications
You can also search for this author in PubMed Google Scholar
Marta Mattoso
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luiz M. R. Gadelha Jr..

Additional information

Communicated by: Judy Qiu and Dennis Gannon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gadelha, L.M.R., Wilde, M., Mattoso, M. et al. MTCProv: a practical provenance query framework for many-task scientific computing. Distrib Parallel Databases 30, 351–370 (2012). https://doi.org/10.1007/s10619-012-7104-4

Download citation

Published: 17 August 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10619-012-7104-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MTCProv: a practical provenance query framework for many-task scientific computing

Abstract

Access this article

Similar content being viewed by others

Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications

Workflow Provenance for Big Data: From Modelling to Reporting

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MTCProv: a practical provenance query framework for many-task scientific computing

Abstract

Access this article

Similar content being viewed by others

Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications

Workflow Provenance for Big Data: From Modelling to Reporting

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation