skip to main content
10.1145/1453101.1453147acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Experience in using a process language to define scientific workflow and generate dataset provenance

Published: 09 November 2008 Publication History

Abstract

This paper describes our experiences in exploring the applicability of software engineering approaches to scientific data management problems. Specifically, this paper describes how process definition languages can be used to expedite production of scientific datasets as well as to generate documentation of their provenance. Our approach uses a process definition language that incorporates powerful semantics to encode scientific processes in the form of a Process Definition Graph (PDG). The paper describes how execution of the PDG-defined process can generate Dataset Derivation Graphs (DDGs), metadata that document how the scientific process developed each of its product datasets. The paper uses an example to show that scientific processes may be complex and to illustrate why some of the more powerful semantic features of the process definition language are useful in supporting clarity and conciseness in representing such processes. This work is similar in goals to work generally referred to as Scientific Workflow. The paper demonstrates the contribution that software engineering can make to this domain.

References

[1]
Ellison, A. M., Osterweil, L. J., Hadley, J. L., Wise, A., et al. 2006. Analytic Webs Support the Synthesis of Ecological Data Sets. Ecology, 87, 6. June 2006, 1345--1358.
[2]
Osterweil, L. J., Wise, A., Clarke, L. A., Ellison, A. M., et al. 2005. Process Technology To Facilitate the Conduct of Science. In Proceedings of the Software Process Workshop, (Beijing, China, May 2005), Springer-Verlag, 403--415.
[3]
Boose, E. R., Ellison, A. M., Osterweil, L. J., Podorozhny, R., et al. 2007. Ensuring Reliable Datasets for Environmental Models and Forecasts. Ecological Informatics 2, 237--247.
[4]
Dingman, S. L. 2002. Physical Hydrology. 2nd Ed. Prentice Hall, NJ.
[5]
Altintas, I., Berkeley, C., Jaeger, E., Jones, M., et al. 2004. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, (Santorini Island, Greece), 423--424.
[6]
Cass, A. G., Lerner, B. S., Mccall, E. K., Osterweil, L. J., et al. 2000. Little-JIL/Juliette: A Process Definition Language and Interpreter. In Proceedings of the 22nd International Conference on Software Engineering, Demonstration Paper, (Limerick, Ireland, 4--11 June), 754--758.
[7]
Wise, A. 2006. Little-JIL 1.5 Language Report. Department of Computer Science, University of Massachusetts, UM-CS-2006-51.
[8]
Foster, I., Vöckler, J., Wilde, M. and Zhao, Y. 2003. the Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In Proceedings of the 15th International Conference on Scientific and Statistical Database Management, IEEE Computer Society, 1--11.
[9]
Foster, I., Vöckler, J. S., Wilde, M. and Zhao, Y. 2002. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management, 37--46.
[10]
Deelman, E., Blythe, J., Gil, Y. and Kesselman, C. 2004. Workflow Management In Griphyn. In Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers, 99--116.
[11]
Wolstencroft, K., Oinn, T., Goble, C., Ferris, J., et al. 2005. Panoply of Utilities In Taverna. In Proceedings of the First International Conference on E-Science and Grid Computing, IEEE Computer Society 156--162.
[12]
Oinn, T., Addis, M., Ferris, J., Marvin, D., et al. 2004. Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics, 20, 17, 3045--3054.
[13]
Heinis, T., Pautasso, C. and Alonso, G. 2006. Mirroring Resources or Mapping Requests: Implementing WS-RF for Grid Workflows. In Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 497--504.
[14]
Pautasso, C. and Alonso, G. 2005. The Jopera Visual Composition Language. Journal of Visual Languages & Computing, 16, 1--2, 119--152.
[15]
Eclipse.Org 2007. Eclipse-An Open Development Platform, 2007.
[16]
Fahringer, T., Jugravu, A., Pllana, S., Prodan, R., et al. 2005. ASKALON: A Tool Set for Cluster and Grid Computing: Research Articles. Concurrency and Computation: Practice and Experience, 17, 2--4, 143--169.
[17]
Fahringer, T., Prodan, R., Duan, R., Nerieri, F., et al. 2005. ASKALON: A Grid Application Development and Computing Environment. In Proceedings of the Sixth IEEE/ACM International Workshop on Grid Computing, IEEE Computer Society, 122--131.
[18]
Ludäscher, B., Altintas, I., Berkeley, C., Higgins, D., et al. 2006. Scientific Workflow Management and the Kepler System: Research Articles. Concurrency and Computation: Practice & Experience, 18, 10, 1039--1065.
[19]
Altintas, I., Barney, O. and Jaeger-Frank, E. 2006. Provenance Collection Support In the Kepler Scientific Workflow System. In Proceedings of the International Provenance and Annotation Workshop (Revised Selected Papers), (Chicago, IL, May 3--5, 2006), Springer Verlag 118--132.
[20]
Edwards, S. A. and Lee, E. A. 2003. The Semantics and Execution of A Synchronous Block-Diagram Language. Science of Computer Programming, 48, 1, 21--42.
[21]
Baldwin, P., Kohli, S., Lee, E. A., Liu, X., et al. 2004. Modeling of Sensor Nets In Ptolemy II. In Proceedings of the Third International Symposium on Information Processing In Sensor Networks (Berkeley, California), ACM, 359--368.
[22]
Girault, A., Lee, B. and Lee, E. A. 1999. Hierarchical Finite State Machines with Multiple Concurrency Models. IEEE Transactions on CAD of Integrated Circuits and Systems, 18, 6, 742--760.
[23]
McPhillips, T. M. and Bowers, S. 2005. An Approach for Pipelining Nested Collections In Scientific Workflows. SIGMOD Record, 34, 3, 12--17.
[24]
Simmhan, Y. L., Plale, B. and Gannon, D. 2005. A Survey of Data Provenance In E-Science. ACM SIGMOD Record, 34, 3, 31--36.
[25]
Moreau, L., Ludäscher, B., Altintas, I., Barga, R. S., et al. 2008. The First Provenance Challenge. Concurrency and Computation: Practice & Experience, 20, 5. April, 2008, 409--418.
[26]
Buneman, P., Khanna, S. and Tan, W.-C. 2001. Why and Where: A Characterization of Data Provenance. In Proceedings of the Eighth International Conference on Database Theory, (London, UK, January 2001), Lecture Notes In Computer Science 1973, Springer Verlag, 316--330.
[27]
Lanter, D. P. 1991. Design of A Lineage-Based Meta-Data Base for GIS. Cartography and Geographic Information Systems, 18, 4, 255--261.
[28]
Aiken, A., Chen, J., Stonebraker, M. and Woodruff, A. 1996. Tioga-2: A Direct Manipulation Database Visualization Environment. In Proceedings of the Twelfth International Conference on Data Engineering, IEEE Computer Society, 208--217
[29]
Clemm, G. M. and Osterweil, L. J. 1990. A Mechanism for Environment Integration. ACM Transactions on Programming Languages and Systems, 12, 1. January, 1--25.
[30]
Feldman, S. I. 1979. Make---A Program for Maintaining Computer Programs. Software---Practice and Experience, 9, 3. March, 255--265.
[31]
Rochkind, M. J. 1975. The Source Code Control System. IEEE Transactions on Software Engineering, SE-1. December 1975, 364--370.
[32]
Callahan, S. P., Freire, J., Santos, E., Scheidegger, C. E., et al. 2006. Vistrails: Visualization Meets Data Management. In Proceedings of the International Conference on Management of Data, (Chicago, IL, June 2006), ACM SIGMOD, 745--747.
[33]
Dwyer, M. B., Clarke, L. A., Cobleigh, J. M. and Naumovich, G. 2004. Flow Analysis for Verifying Properties of Concurrent Software Systems. ACM Transactions on Software Engineering and Methodology, 13, 4. October 2004, 359--430.
[34]
Cobleigh, J. M., Clarke, L. A. and Osterweil, L. J. 2002. FLAVERS: A Finite State Verification Technique for Software Systems. IBM Systems Journal, 41, 1. 2002, 140--165.
[35]
Oates, T. and Jensen, D. 1999. Toward A Theoretical Understanding of Why and When Decision Tree Pruning Algorithms Fail. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (Orlando, Florida.), 372--378.

Cited By

View all
  • (2018)An Algorithm for Finding the Minimum Cost of Storing and Regenerating Datasets in Multiple CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24919206:2(519-531)Online publication date: 1-Apr-2018
  • (2015)Predictive Analytics for Business Processes in Service ManagementMaximizing Management Performance and Quality with Service Analytics10.4018/978-1-4666-8496-6.ch013(366-403)Online publication date: 2015
  • (2015)Dynamic On-the-Fly Minimum Cost Benchmarking for Storing Generated Scientific Datasets in the CloudIEEE Transactions on Computers10.1109/TC.2015.238980164:10(2781-2795)Online publication date: 1-Oct-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
November 2008
369 pages
ISBN:9781595939951
DOI:10.1145/1453101
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. continuous process improvement
  2. data provenance
  3. scientific workflow

Qualifiers

  • Research-article

Conference

SIGSOFT '08/FSE-16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)An Algorithm for Finding the Minimum Cost of Storing and Regenerating Datasets in Multiple CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24919206:2(519-531)Online publication date: 1-Apr-2018
  • (2015)Predictive Analytics for Business Processes in Service ManagementMaximizing Management Performance and Quality with Service Analytics10.4018/978-1-4666-8496-6.ch013(366-403)Online publication date: 2015
  • (2015)Dynamic On-the-Fly Minimum Cost Benchmarking for Storing Generated Scientific Datasets in the CloudIEEE Transactions on Computers10.1109/TC.2015.238980164:10(2781-2795)Online publication date: 1-Oct-2015
  • (2014)Insider Threat Identification by Process AnalysisProceedings of the 2014 IEEE Security and Privacy Workshops10.1109/SPW.2014.40(251-264)Online publication date: 17-May-2014
  • (2014)On Formal Definition and Analysis of Formal Verification ProcessesSpecification, Algebra, and Software10.1007/978-3-642-54624-2_2(35-52)Online publication date: 2014
  • (2013)An Algorithm for Cost-Effectively Storing Scientific Datasets with Multiple Service Providers in the CloudProceedings of the 2013 IEEE 9th International Conference on e-Science10.1109/eScience.2013.34(285-292)Online publication date: 22-Oct-2013
  • (2013)A Highly Practical Approach toward Achieving Minimum Data Sets Storage Cost in the CloudIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.2024:6(1234-1244)Online publication date: 1-Jun-2013
  • (2013)BibliographyComputation and Storage in the Cloud10.1016/B978-0-12-407767-6.00021-4(109-113)Online publication date: 2013
  • (2012)Cancer treatment planningProceedings of the 4th International Workshop on Software Engineering in Health Care10.5555/2667036.2667040(19-25)Online publication date: 4-Jun-2012
  • (2012)Cancer treatment planning: Formal methods to the rescue2012 4th International Workshop on Software Engineering in Health Care (SEHC)10.1109/SEHC.2012.6227014(19-25)Online publication date: Jun-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media