Abstract
In this paper we look at the application of XML data management support in scientific data analysis workflows. We describe a software infrastructure that aims to address issues associated with metadata management, data storage and management, and execution of data analysis workflows on distributed storage and compute platforms. This system couples a distributed, filter-stream based dataflow engine with a distributed XML-based data and metadata management system. We present experimental results from a biomedical image analysis use case that involves processing of digitized microscopy images for feature segmentation.
- M. Aeschlimann, P. Dinda, J. Lopez, B. Lowekamp, L. Kallivokas, and D. O'Hallaron. Preliminary report on the design of a framework for distributed visualization. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), pages 1833--1839, Las Vegas, NV, June 1999.Google Scholar
- M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11):1457--1478, Oct. 2001. Google ScholarDigital Library
- E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1), 2003.Google ScholarCross Ref
- I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database Management, 2002. Google ScholarDigital Library
- J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A computation management agent for multi-institutional grids. In Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10). IEEE Press, Aug 2001. Google ScholarDigital Library
- S. Hastings, Distributed architectures: A java-based process management system. Master's thesis, Computer Science Department, Rensselear Polytechnic Institute, 2002.Google Scholar
- S. Hastings, S. Langella, S. Oster, and J. Saltz. Distributed data management and integration: The mobius project. In GGF Semantic Grid Workshop 2004, pages 20--38. GGF, June 2004.Google Scholar
- C. Isert and K. Schwan. ACDS: Adapting computational data streams for high performance. In 14th International Parallel & Distributed Processing Symposium (IPDPS 2000), pages 641--646, Cancun, Mexico, May 2000. Google ScholarDigital Library
- S. Langella, S. Hastings, S. Oster, T. Kurc, U. Catalyurek, and J. Saltz. A distributed data management middleware for data-driven application systems. In Proceedings of 2004 IEEE International Conference on Cluster Computing, September 2004. Google ScholarDigital Library
- B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, to appear, 2005.Google Scholar
- L. Moreau, Y. Zhao, I. Foster, J. Voeckler, and M. Wilde. XDTM: the XML Dataset Typing and Mapping for Specifying Datasets. In Proceedings of the 2005 European Grid Conference (EGC'05), Amsterdam, Netherlands, Feb. 2005. Google ScholarDigital Library
- D. Thain, J. Bent, A. Arpaci-Dusseau, R. Arpaci-Dusseau, and M. Livny. Pipeline and batch sharing in grid workloads. In Proceedings of High-Performance Distributed Computing (HPDC-12), pages 152--161, Seattle, Washington, June 2003. Google ScholarDigital Library
Index Terms
- XML database support for distributed execution of data-intensive scientific workflows
Recommendations
Approaches to Distributed Execution of Scientific Workflows in Kepler
Scalable Workflow Enactment Engines and TechnologyThe Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present ...
A Survey of Data-Intensive Scientific Workflow Management
Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for ...
Data Management Challenges of Data-Intensive Scientific Workflows
CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the GridScientific workflows play an important role in today’s science. Many disciplines rely on workflow technologies to orchestrate the execution of thousands of computational tasks. Much research to-date focuses on efficient, scalable, and robust workflow ...
Comments