Abstract
A large genomics project involves a significant number of researchers and technicians performing dozens of tasks, either manual (e.g. performing laboratory experiments), computer assisted (e.g. looking for genes in the GENBANK database), or sometimes performed entirely automatically by the computer (e.g. sequence assembly). It has become apparent that managing such projects poses overwhelming problems and may lead to results of lower or even unacceptable quality, or possibly drastically increased project costs. In this paper, we present a design and an initial implementation of a distributed workflow system created to schedule and support activities in a genomics laboratory. The focus of the activities in the laboratory is the discovery of protein-protein interactions of fungi, specifically Neurospora crassa. We present our approach of developing, adapting and applying workflow technology in the genomics lab and illustrate it using one distinct part of a larger workflow to discover protein-protein interactions. Novel features of our system include the ability to monitor the quality and timeliness of the results and if necessary, suggesting and incorporating changes to the selected tasks and their scheduling.
Similar content being viewed by others
References
W. Aalst and T. Basten, “Inheritance of workflows: An approach to tackling problems related to change,” Computing Science Reports 99/06, Eindhoven University of Technology, Eindhoven, 1999.
W. Aalst and K. Hee, Workflow Management: Models, Methods, and Systems, MIT Press: Cambridge, MA, 2002.
W. Aalst and S. Jablonski, “Dealing withworkflowchange: Identification of issues and solutions,” International Journal of Computer Systems, Science, and Engineering, vol. 15, no. 5, pp. 267–276, 2000.
S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PST-BLAST: A new generation of protein database search programs,” Nucleic Acis Research, vol. 25, pp. 3389–3402, 1997.
M. Ansari, L. Ness, M. Rusinkiewicz, and A. Sheth, “Using flexible transactions to support multisystem telecommunication applications,” in Proceedings of the 18th Intl. Conference on Very Large Data-bases, Aug. 1992, pp. 65–76.
J. Arnold, Editorial. Fungal Genetics and Biology, vol. 21, pp. 254–257, 1997.
J. Arnold and M.T. Cushion, “Constructing a physical map of the Pneumocystis genome,” J. Euk. Microbiol., vol. 44, p. 8S, 1997.
B. Arpinar, J. Miller, and A. Sheth, “An efficient data extraction and storage utility for XML documents,” 39th ACM Southeast Conference, Athens, GA, March 2001, pp. 293–295.
G.W. Beadle and E.L. Tatum, “Genetic control of biochemical reactions in Neurospora,” in Proceedings of the National Academy of Sciences, USA, vol. 27, pp. 499–506, 1941.
J.W. Bennett and J. Arnold, “Genomics of fungi. The Mycota VIII,” in Biology of the Fungal Cell, Howard and Gow (Eds.), Springer-Verlag: NY, 2001, pp. 267–297.
P.M. Berry and K.L. Myers, “Adaptive process management: An AI perspective,” in ACM Conference on Computer Supported Cooperative Work, Seattle, Washington, 1998.
U.S. Bhalla and R. Iyengar, “Emergent properties of networks of biological signaling pathways,” Science, vol. 283, pp. 381–387, 1999.
S.M. Bhandarkar and J. Arnold, “Parallel simulated annealing on the hypercube for chromosome reconstruction, invited paper,” in Proc 14th IMACS World Congress on Computational and Applied Mathematics, Atlanta, GA, vol. 3, pp. 1109–1112, 1994.
S.M. Bhandarkar, S. Chirravuri, S. Machaka, and J. Arnold, “Parallel computing for chromosome reconstruction via ordering of DNA sequences,” Parallel Computing, vol. 24, pp. 1177–1204, 1998.
S.M. Bhandarkar, S.A. Machaka, S.S. Shete, and R.N. Kota, “Parallel computation of a maximum likelihood estimator of a physical map,” Genetics, vol. 157, pp. 1021–1043, 2001.
A.J. Bonner, A. Shrufi, and S. Rozen, “LabFlow-1: A Database benchmark for high-throughput workflow management,” in Proceedings, Fifth International Conference on Extending Database Technology (EDBT), Avignon, France, March 1996, pp. 463–478. Springer-Verlag, Lecture Notes in Computer Science, vol. 1057.
J. Cardoso, J. Miller, and A. Sheth, “Workflowquality of service: Its specification and computation,” Technical Report, LSDIS Lab, Computer Science, University of Georgia, April 2002.
Y. Chen, “Design and implementation of dynamic process definition modifications in OrbWork enactment system,” Masters Thesis, UGA, 2000.
A. Cichocki and M. Rusinkiewicz, “Migrating workflows,” Advances in Workflow Management Systems and Interoperability, Istanbul, Turkey, 1997.
A.J. Cuticchia, J. Arnold, H. Brody, and W.E. Timberlake, “CMAP: Contig mapping and analysis package: A relational database for chromosome reconstruction,” CABIOS, vol. 8, pp. 467–474, 1992.
R.H. Davis, Neurospora Contributions of a Model Organism, Oxford University Press, New York, 2000.
J.L. DeRisi, V.R. Iyer, and P.O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, pp. 680–686, 1997.
L. Dogac, A. Kalinechenko, T. Ozsu, and A Sheth (Eds.), “Workflow management systems and interoperability,” NATO ASI Series F, vol. 164, Springer Verlag: Berlin, 1998, p. 524.
C. Ellis, K. Keddara, and G. Rozenberg, “Dynamic changes within workflow systems,” in Proc. of the Conf. on Organizational Computing Systems (COOCS'95), 1995.
B. Ewing and P. Green, “Base calling of automated sequencer traces using Phred II: Error probability,” Genome Research, vol. 8, pp. 186–194, 1998.
X. Fang, J. Arnold, and J.A. Miller, “J3DV: A java-based 3D database visualization tool,” Software--Practice and Experience, vol. 32, no. 5, pp. 443–463, 2002.
R.F. Geever, L. Huiet, J.A. Baum, B.M. Tyler, V.B. Patel, B.J. Rutledge, M.E. Case, and N.H. Giles, “DNA sequence, organization and regulation of the qa gene cluster of Neurospora crassa,” J. Mol. Biol., vol. 207, pp. 15–34, 1989.
D. Georgakopoulos, M. Hornick, and A. Sheth, “An overview of workflow management: From process modeling to infrastructure for automation,” Distributed and Parallel Databases Journal, vol. 3, no. 2, pp. 119–153, 1995.
N. Goodman, S. Rozen, and L.D. Stein, “The labflow system for workflow management in large scale biology research laboratories,” in 6th Int. Conf. on Intelligent Systems for Molecular Biology, Montreal, Canada, AAAI Press: Menlo Park, 1998, pp. 69–77.
D. Gordon, C. Abajian, and P. Green, “Consed: A graphical tool for sequence finishing,” Genome Research, vol. 8, pp. 195–202, 1998.
N. Guimaraes, P. Antunes, and A. Pereira, “The integration of workflow systems and collaboration tools,” Advances in Workflow Management Systems and Interoperability, Istanbul, Turkey, 1997.
D. Hall, “New computational tools for genome mapping,” Ph.D. Dissertation, University of Georgia, 1999.
R.D. Hall, S. Bhandarkar, and J. Arnold, “ODS2:Amulti-platform software application for creating integrated physical and genetic maps,” Genetics, vol. 157, pp. 1045–1056, 2001a. Also in Hall, RD “New computational tools for genome mapping,” Ph.D. Dissertation, University of Georgia, 1999.
R.D. Hall, J.A. Miller, J. Arnold, K.J. Kochut, A.P. Sheth, and M.J. Weise, “Using workflow to build an information management system for a geographically distributed genome sequencing initiative,” in Genomics of Plants and Fungi, R.A. Prade and H.J. Bohnert (Eds.), Marcel Dekker: New York, in press.
D. Hall, J. Miller, M. Weise, J. Arnold, K. Kochut, and A. Sheth, “Using workflow to build an information management system for a geographically distributed genome initiative,” submitted. In Hall, RD “New computational tools for genome mapping,” Ph.D. Dissertation, University of Georgia, 1999.
Y. Han and A. Sheth, “On adaptive workflow modeling,” in 4th International Conference on Information Systems Analysis and Synthesis, Orlando, Florida, July 1998.
C. Hensinger, M. Reichert, Th. Bauer, Th. Strzeletz, and P. Dadam, “ADEPTworkflow--Advanced workflow technology for the efficient support of adaptive, enterprise-wide processes,” in Conference on Extending Database Technology, Konstanz, Germany, March 2000.
T. Hermann, “Workflow management systems: Ensuring organizational flexibility by possibilities of adaptation and negotiation,” in Proc. of the Conf. on Organizational Computing Systems (COOCS'95), 1995.
D. Hollingsworth, “The Workflow Reference Model,” The Workflow Management Coalition, 1994.
http://gene.genetics.uga.edu. Fungal Genome Resource.
J.R. Hudson, E.P. Dawson, K.L. Rushing, C.H. Jackson, D. Lockshon, D. Conover, C. Lanciault, J.R. Harris, S.J. Simmons, R. Rothstein, and S. Fields, “The complete set of predicted genes from Saccharomyces cerevisiae in a readily usable form,” Genome Research, vol. 7, pp. 1169–1173, 1997.
C.A. Hutchison, S.N. Peterson, S.R. Gill et al., “Global transposon mutagenesis and a minimal Mycoplasma genome,” Science, vol. 286, pp. 2165–2169, 1999.
International Human Genome Sequencing Consortium, “Initial sequencing and analysis of the human genome,” Nature, vol. 409, pp. 860–918, 2001.
T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki, “Toward a protein-protein interaction map of the budding yeast: A comprehensive system to to examine twohybrid interactions in all possible combinations between the yeast proteins,” PNAS USA, vol. 97, pp. 1143–1147, 2000.
S. Jablonski, K. Stein, and M. Teschke, “Experiences in workflow management for scientific computing,” in Proceedings of the Workshop on Workflow Management in Scientific and Engineering Applications (at DEXA97), Toulouse, France, 1997.
JDO, “Java data object expert group,” Java Data Object. 2000. JSR000012, Version 0.8. http://java.sun.com/aboutJava/communityprocess/review/jsr012/index.html.
J. Kececioglu, H.-P. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, and M. Vingron, “A polyhedral approach to sequence alignment problems,” Discrete Applied Mathematics, vol. 104, pp. 143–186, 2000.
J.D. Kececioglu and E.W. Myers, “Combinatorial algorithms for DNA sequence assembly,” Algorithmica, vol. 13, pp. 7–51, 1995.
H.S. Kelkar, J. Griffith, M.E. Case, S.F. Covert, R.D. Hall, C.H. Keith, J.S. Oliver, M.J. Orbach, M.S. Sachs, J.R. Wagner, M.J. Weise, J. Wunderlich, and J. Arnold, “The Neurospora crassa genome: Cosmid libraries sorted by chromosome,” Genetics, vol. 157, pp. 979–990, 2001.
K.J. Kochut, J. Arnold, J.A. Miller, and W.D. Potter, “Design of an object-oriented database for reverse genetics,” in Proceedings, First International Conference on Intelligent Systems for Molecular Biology, L. Hunter, D. Searls, and J. Shavlik (Eds.), AAAI Press: Menlo Park, CA, 1993, pp. 234–242.
K.J. Kochut, A.P. Sheth, and J.A. Miller, “Optimizing workflows,” Component Strategies, vol. 1, pp. 45–57 (SIGS Publications), 1999.
E. Kraemer, J. Wang, J. Guo, S. Hopkins, and J. Arnold, “An analysis of gene-finding approaches for Neurospora crassa,” Bioinformatics, vol. 17, pp. 901–912, 2001.
N. Krishnakumar and A. Sheth, “Managing heterogeneous multi-system tasks to support enterprise-wide operations,” Distributed and Parallel Databases Journal, vol. 3, no. 2, 1995.
K. Lee, J.J. Loros, and J.C. Dunlap, “Interconnected feedback loops in the Neurospora Circadian system,” Science, vol. 289, pp. 107–110, 2000.
Z. Luo, A. Sheth, K. Kochut, and B. Arpinar, “Exception handling for conflict resolution in crossorganizational workflows,” Technical Report, LSDIS Lab, Computer Science, University of Georgia, April 2002.
Z. Luo, A. Sheth, K.J. Kochut, and J.A. Miller, “Exception handling in workflow systems,” Applied Intelligence: The International Journal of AI, Neural Networks, and Complex Problem-Solving Technologies, vol. 13, no. 2, pp. 125–147, 2000.
R. McClatchey, J.-M. Le Geoff, N. Baker, W. Harris, and Z. Kovacs, “A distributed workflow and product data management application for the construction of large scale scientific apparatus,” Advances in Workflow Management Systems and Interoperability, Istanbul, Turkey, 1997.
METEOR project home page, http://lsdis.cs.uga.edu/proj/meteor/meteor.html
J.A. Miller, J. Arnold, K.J. Kochut, A.J. Cuticchia, and W.D. Potter, “Query driven simulation as a tool for genetic engineers,” in Proceedings of the International Conference on Simulation in Engineering Education, Newport Beach, CA, 1991, pp. 67–72. Also at http://chief.cs.uga.edu/∼miller/papers
D. Miller, J. Guo, E. Kraemer, and Y. Xiong, “On-the-fly calculation and verification of consistent steering transactions,” in Proceedings of the Supercomputing Conference (SC2001), Denver, Colorado, 2001.
J. Miller, D. Palaniswami, A. Sheth, K. Kochut, and H. Singh, “WebWork: METEOR's web-based workflow management system,” Journal of Intelligent Information Systems (JIIS), vol. 10, pp. 186–215, 1998.
J.A. Miller, A. Sheth, K.J. Kochut, and X. Wang, “CORBA-based run time architectures for workflow management systems,” Journal of Database Management, Special Issue on Multidatabases, vol. 7, no. 1, pp. 16–27, 1996.
J.A. Miller, A. Sheth, K.J. Kochut, X. Wang, and A. Murugan, “Simulation modeling with workflow technology,” in Proceedings of the 1995 Winter Simulation Conference, Dec. 1995, pp. 612–619. Also at http://chief.cs.uga.edu/∼miller/papers.
OMG 2001. OMG, UML Resources Page, http://www.omg.org/technology/uml.
D.D. Perkins, “Neurospora: The organism behind the molecular revolution,” Genetics, vol. 130, pp. 687–701, 1992.
D.D. Perkins, “Neurospora crassa genetic maps,” in Genetic Maps: Locus Maps of Complex Genomes, S.J. O'Brien (Ed.), Cold Spring Harbor Press: Cold Spring Harbor, NY, pp. 3.11–3.20, 1993.
D.D. Perkins, M.A. Sachs, and A. Radford, “The neuorspora compendium chromosomal loci,” Academic Press: New York.
D.D. Perkins, B.C. Turner, and E.G. Barry, “Strains of Neurospora collected from nature,” Evolution, vol. 30, pp. 281–313, 1976.
R.A. Prade, J. Griffith, K. Kochut, J. Arnold, and W.E. Timberlake, “In vitro reconstruction of the Aspergillus(=Emericella) nidulans genome,” in Proceedings of the National Academy of Sciences USA, vol. 94, pp. 14564–14569, 1997.
M. Reichert and P. Dadam, “ADEPTflex--Supporting dynamic changes of workflows without losing control,” Journal of Intelligent Information Systems--Special Issue on Workflow Managament, vol. 10, no. 2, pp. 93–129, 1998.
J. Rumbaugh, Ivar Jacobson, and Grady Booch, The Unified Modeling Language Reference Manual, Addison-Wesley: Reading, MA, 1998.
A. Sheth, “From contemporary workflow process automation to adaptive and dynamic work activity coordination and collaboration,” in Proceedings of the Workshop on Workflows in Scientific and Engineering Applications, Toulouse, France, 1997.
A. Sheth, W. Aalst, and I. Arpinar, “Processes driving the networked economy,” IEEE Concurrency, vol. 7, no. 3, pp. 18–31, 1999.
A. Sheth and K.J. Kochut, “Workflow applications to research agenda: Scalable and dynamic work coordination and collaboration systems,” Workflow Management Systems and Interoperability, A. Dogac et al. (Eds.), Springer Verlag: Berlin, 1998, pp. 35–60.
A. Sheth, K.J. Kochut, J.A. Miller, D. Worah, S. Das, D. Lin, D. Pallaniswami, J. Lynch, and I. Shevchenko, “Supporting state-wide immunization tracking using multi-paradigm workflow technology,” in Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, 1996, pp. 263–273.
A. Sheth, D. Worah, K.J. Kochut, J.A. Miller, K.E. Zheng, D. Palaniswami, and S. Das, “The METEOR workflow management system and its use in prototyping significant healthcare applications,” in Proceedings Toward an Electronic Patient Record Conference (TEPR' 97), vol. 2, Nashville, TN, 1997, pp. 267–278.
J. Skolnick, J.S. Fetrow, and A. Kolinski, “Structural genomics and its importance for gene function analysis,” Nature Biotechnology, vol. 18, pp. 283–287, 2000.
S.H. Strogatz, “Exploring complex networks,” Nature, vol. 410, pp. 268–276, 2001.
Tian, Hui, “Storage management issues for high performance database visualization,” in Proceedings of the 39th Annual Southeastern ACM Conference, Athens, Georgia, March 2001, pp. 251–256.
P. Uetz, L. Glot, G. Cagney, T.A. Mansfield, R.S. Judson, J.R. Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, B. Godwin, D. Conover, T. Kalbfleish, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J.M. Rothberg, “A comprehensive analysis of protein-protein interactions in Sacharomyces cerevisiae,” Nature, vol. 403, pp. 623–627, 2001.
J.C. Venter, M.D. Adams, and E.W. Myers et al., “The sequence of the human genome,” Science, vol. 291, pp. 13040–1351, 2001.
M. Vidal, “Protein-protein interactions,” Encyclopedia of Genetics, Academic Press, vol. 3, pp. 1551–1552, 2002.
R.T. Watson, G.M. Zinkhan, and L.F. Pitt, “Object-orientation: A new perspective on strategy,” Paper read at Academic Industry Working Conference on Research Challenges, April 27–29, 2000at Buffalo, NY.
D. Worah, A. Sheth, K. Kochut, and J. Miller, “An error handling framework for the ORBWork workflow enactment service of METEOR,” Technical Report, LSDIS Lab. Department of Computer Science, University of Georgia.
Workflow Management Coalition Standards, http://www.aiim.org/wfmc/mainframe.htm
S. Wu, A. Sheth, J.A. Miller, and Z. Luo, “Authorization and access control of application data in work-flow systems,” Journal of Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies (JIIS), vol. 18, no. 1, pp. 71–94, 2002.
Z. Xu, B. Lance, C. Vargas, B. Arpinar, S. Bhandarkar, E. Kraemer, K. Kochut. J. Miller, J. Wagner, M. Weise, J. Wunderlich, J. Stringer, G. Smulian, M. Cushion, and J. Arnold, “Mapping by sequencing the Pneumocystis genome using the ODS3 tool,” Genetics, in press.
Y. Zhang, “A visualization system for protein interaction mapping using Java 3D technology,” Masters Thesis, UGA, 2001.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kochut, K., Arnold, J., Sheth, A. et al. IntelliGEN: A Distributed Workflow System for Discovering Protein-Protein Interactions. Distributed and Parallel Databases 13, 43–72 (2003). https://doi.org/10.1023/A:1021565722755
Issue Date:
DOI: https://doi.org/10.1023/A:1021565722755