Abstract
Steps in scientific workflows often generate collections of results, causing the data flowing through workflows to become increasingly nested. Because conventional workflow components (or actors) typically operate on simple or application-specific data types, additional actors often are required to manage these nested data collections. As a result, conventional workflows become increasingly complex as data becomes more nested. This paper describes a new paradigm for developing scientific workflows that transparently manages nested data collections. Collection-oriented workflows have a number of advantages over conventional approaches including simpler workflow designs (e.g., requiring fewer actors and control-flow constructs) that are invariant under changes in data nesting. Our implementation within the Kepler scientific workflow system enables the explicit representation of collections and collection schemas, concurrent operation over collection contents via multi-level pipeline parallelism, and allows collection-aware actors to be composed readily from conventional actors.
Work supported in part by SciDAC/SDM (DE-FC02-01ER25486), NSF/SEEK (DBI-0533368), and NSF/GEON (EAR-0225673).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludäscher, B., Mock, S.: Kepler: An Extensible System for Design and Execution of Scientific Workflows. In: SSDBM (2004)
Buneman, P., Naqvi, S.A., Tannen, V., Wong, L.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1) (1995)
Davidson, S., Hara, C., Popa, L.: Querying an Object-Oriented Database using CPL. In: Brazilian Symposium on Databases (SBBD) (1997)
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping Scientific Workflows onto the Grid. In: European Across Grids Conference (2004)
Goderis, A., Sattler, U., Lord, P., Goble, C.A.: Seven Bottlenecks to Workflow Reuse and Repurposing. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 323–337. Springer, Heidelberg (2005)
Golab, L., Özsu, M.T.: Issues in Data Stream Management. In: ACM SIGMOD Record (2003)
Gupta, A.K., Suciu, D.: Stream Processing of XPath Queries with Predicates. In: ACM SIGMOD, pp. 419–430 (2003)
Ives, Z.G., Halevy, A.Y., Weld, D.S.: An XML Query Engine for Network-Bound Data. VLDB Journal 11(4), 380–402 (2002)
Kahn, G., MacQueen, D.B.: Coroutines and Networks of Parallel Processes. In: IFIP Congress (1977)
Lee, E.A., Messerschmitt, D.G.: Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. Comput. C-36 (1987)
Leser, U., Naumann, F.: (Almost) Hands-Off Information Integration for the Life Sciences. In: Conference on Innovative Data Systems Research (CIDR) (2005)
Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. In: Concurrency and Computation: Practice & Experience (2005)
MacLeod, R.S., Weinstein, D.M., de St. Germain, J.D., Johnson, C.R., Parker, S.G., Brooks, D.: SCIRun/BioPSE: Integrated Problem Solving Environment for Bioelectric Field Problems and Visualization. In: Symposium on Biomedical Imaging (ISBI): From Nano to Macro (2004)
Maddison, D., Swofford, D., Maddison, W.: NEXUS: An Extensible File Format for Systematic Information. Systematic Biology 46(4), 590–621 (1997)
Majithia, S., Shields, M.S., Taylor, I.J., Wang, I.: Triana: A Graphical Web Service Composition and Execution Toolkit. In: ICWS (2004)
May, W.: XPath-Logic and XPathLog: A Logic-Programming-Style XML Data Manipulation Language. Theory and Practice of Logic Programming 4(3), 239–287 (2004)
McPhillips, T., Bowers, S.: An Approach for Pipelining Nested Collections in Scientific Workflows. ACM SIGMOD Record 34(3), 12–17 (2005)
Morrison, J.: Flow-Based Programming. Van Nostrand Reinhold (1994)
Murata, M., Lee, D., Mani, M.: Taxonomy of XML Schema Languages using Formal Language Theory. In: Extreme Markup Languages Conferences (2001)
Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics 20(17) (2004)
Swofford, D.: PAUP*: Phylogenetic Analysis Under Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts
Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency – Practice and Experience 17(2-4) (2005)
Tian, F., Reinwald, B., Pirahesh, H., Mayr, T., Myllymaki, J.: Implementing a Scalable XML Publish/Subscribe System Using a Relational Database System. In: ACM SIGMOD, pp. 479–490 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McPhillips, T., Bowers, S., Ludäscher, B. (2006). Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. In: Leser, U., Naumann, F., Eckman, B. (eds) Data Integration in the Life Sciences. DILS 2006. Lecture Notes in Computer Science(), vol 4075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11799511_23
Download citation
DOI: https://doi.org/10.1007/11799511_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36593-8
Online ISBN: 978-3-540-36595-2
eBook Packages: Computer ScienceComputer Science (R0)