Abstract
Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [26]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology.
Similar content being viewed by others
References
3D Hit Homepage: http://3dhit.bioinfo.pl/.
S. Abiteboul, P. Buneman, and D. Suciu, Data on theWeb: From Relations to Semistructured Data and XML, Morgan Kaufmann Publishers: San Francisco, 2000.
ACEDB Documentation Library, http://genome.cornell.edu/acedocs/.
S. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, pp. 403–410, 1990.
ASN.1 Standard. Web Site. http://asn1.elibel.tm.fr.
T.L. Bailey and C. Elkan, “Fitting a mixture model by expectation maximization to discover motifs in biopolymers,” in Proceedings of the 2nd International Conference on Intelligent Systems in Molecular Biology (ISMB'94), 1994, pp. 28–36.
T.L. Bailey and M. Gribskov, “Combining evidence using P-values: Application to sequence homology searches,” Bioinformatics, vol. 14, pp. 48–54, 1998.
A. Bairoch and R. Apweiler, “The SWISS-PROT database and its supplement TrEMBL in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45–48, 2000.
P. Baker, A. Brass, S. Bechhofer, C. Goble, N. Paton, and R. Stevens, “Tambis--Transparent access to multiple bioinformatics information sources,” in Proceedings of the 6th International Conference on Intelligent Systems in Molecular Biology (ISMB'98), 1998, pp. 25–34.
P. Baker, C. Goble, S. Bechhofer, N. Paton, R. Stevens, and A. Brass, “An ontology for bioinformatics application,” Bioinformatics, vol. 15, no. 6, pp. 510–520, 1999.
W. Baker, A. van den Broek, E. Camon, P. Hingamp, P. Sterk, G. Stoesser, and M.A. Tuli, “The EMBL nucleotide sequence database,” Nucleic Acids Research, vol. 28, no. 1, pp. 19–23, 2000.
F. Bancilhon, C. Delobel, and P. Kanellakis, Building an Object-Oriented Database System: The Story of O2,” Morgan Kaufmann: San Francisco, 1992.
W.C. Barker, J.S. Garavelli, Z. Hou, H. Huang, R.S. Ledley, P.B. McGarvey, H.-W. Mewes, B.C. Orcutt, F. Pfeiffer, A. Tsugita, C.R. Vinayaka, C. Xiao, L.-S.L. Yeh, and C. Wu, “Protein information resource: A community resource for expert annotation of protein data,” Nucleic Acids Research, vol. 29, pp. 29–32, 2001.
D. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 28, no. 1, pp. 15–18. 2000.
B. Boss, H. Wium Lee, C. Lilley, and I. Jacobs, “Cascading style sheets, level 2,” W3C Recommendation, 1998. http://www.w3.org/TR/REC-CSS2/.
S.H. Bryant, J.-F. Gibrat, and T. Madej, “Threading a database of protein cores,” Proteins, vol. 23, pp. 356–369, 1995.
P. Buneman, “Semistructured data,” in Tutorial Proceedings of the 16th ACM Symposium on Principles of Database Systems, 1997.
C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic DNA,” Journal of Molecular Biology, vol. 268, pp. 78–94, 1997.
CBS Prediction Server: http://www.cbs.dtu.dk/services/.
D. Chamberlain, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu, “XQuery 1.0: An XML query language,” W3C Working Draft, 2001, http://www.w3.org/TR/xquery/.
I.-M. Chen and V. Markowitz, “An overview of the object protocol model (OPM) and the OPM data management tools,” Information Systems, vol. 20, no. 5, pp. 393–418, 1995.
Chime Homepage: http://www.mdlchime.com/chime/.
J. Clark and S. DeRose, “XML path language (XPath) Version 1.0,” W3C Recommendation, 1999, http://www.w3.org/TR/xpath.
P. Clote and R. Backofen, Computational Molecular Biology, an Introduction, John Wiley and Sons, Ltd.: Chichester, 2000.
ClustArray Homepage: http://www.cbs.dtu.dk/services/DNAarray/.
M. Clutter, “Hearing on computational biology,” Statement before the Subcommittee on Science, Technology and Space Committee on Commerce, Science, and Transportation, U.S. Senate, 1996. http://www.nsf.gov/od/lpa/congress/cluttes2.htm.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
J.A. Cuff and G.J. Barton, “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction,” PROTEINS: Structure, Function and Genetics, vol. 34, pp. 508–519, 1999.
S.B. Davidson, C. Overton, V. Tannen, and L. Wong, “Biokleisli: A digital library for biomedical researchers,” International Journal on Digital Libraries, vol. 1, no. 1, pp. 36–53, 1997.
F. Davis, B. Kahle, H. Morris, J. Salem, T. Shen, R. Wang, J. Sui, and M. Grinbaum, “WAIS interface protocol prototype functional specification (Version 1.5),” Thinking Machine Corporation, April' 90, 1990.
A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Research, vol. 27, no. 11, pp. 2369–2376, 1999.
U. Dengler, A.S. Siddiqui, and G.J. Barton, “Protein structural domains: Analysis of the 3Dee domains database,” Proteins, vol. 42, pp. 332–344, 2001.
C. Discala, X. Benigni, E. Barillot, and G. Vaysseix, “DBcat: A catalog of 500 biological databases,” Nucleic Acids Research, vol. 28, no. 1, pp. 8–9, 2000.
D.R. Dolk, “Model management and structured modeling: The role of an information resource dictionary system,” Communications of the ACM, vol. 31, no. 6, pp. 704–718, 1988.
R. Durbin and J. Thierry-Mieg, “The ACEDB genome database,” in Computational Methods in Genome Research, S. Suhai (Ed.), Plenum Press: New York, 1994.
S.R. Eddy, “Profile hidden Markov models,” Bioinformatics, vol. 14, pp. 755–763, 1998.
Entrez Online Dokumentation: http://www.ncbi.nlm.nih.gov/Database/index.html.
T. Etzold, A. Ulyanow, and P. Argos, “SRS: Information retrieval system for molecular biology data banks,” Methods in Enzymology, vol. 266, pp. 114–128, 1996.
D.V. Faulkner and J. Jurka, “Multiple aligned sequence editor (MASE),” Trends in Biochemical Sciences, vol. 13, no. 8, pp. 321–322, 1988.
J. Felsenstein, “PHYLIP--Phylogeny inference package (Version 3.2),” Cladistics, vol. 5, pp. 164–166, 1989.
W. Fujibuchi, S. Goto, H. Migimatsu, I. Uchiyama, A. Ogiwara, Y. Akiyama, and M. Kanehisa, “DBGET/LinkDB: An integrated database retrieval system,” in Pacific Symposium on Biocomputing (PSB'97), 1997, pp. 683–694.
M. Gardiner-Garden and M. Frommer, “CpG islands in vertebrate genomes,” Journal of Molecular Biology, vol. 196, pp. 261–282, 1987.
M.S. Gelfand, A.A. Mironov, and P.A. Pevzner, “Gene recognition via spliced sequence alignment,” in Proceedings of the National Academy of Science USA (PNAS), vol. 93, 1996, pp. 9061–9066.
GenBank Growth: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.
D. George, H.-W. Mewes, and H. Kihara, “A standardized format for sequence data exchange,” Protein Sequence Data Analysis, vol. 1, pp. 27–39, 1987.
D.R. Gilbert, D.R. Westhead, N. Nagano, and J.M. Thornton, “Motif-based searching in TOPS protein topology databases,” Bioinformatics, vol. 5, no. 4, pp. 317–326, 1999.
N. Guex and M.C. Peitsch, “SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling,” Electrophoresis, vol. 18, pp. 2714–2723, 1997.
A. Gupta, H.V. Jagadish, and I.S. Mumick, “Data integration using self-maintainable views,” in Proceedings of the International Conference on Extending Database Technology (EDBT), LNCS, vol. 1057, Springer Verlag, 1996, pp. 140–144.
D. Gusfield, “Efficient methods for multiple sequence alignment with guaranteed error bounds,” Bulletin of Mathematical Biology, vol. 55, no. 141, p. 154, 1993.
M. Hammer and D. McLeod, “Database description with SDM: A semantic database model,” ACM Transactions on Database Systems, vol. 6, no. 3, 1981.
HIV-MAP Homepage: http://hiv-web.lanl.gov/content/hiv-db/MAP/hivmap.html.
K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch, “The PROSITE database, its status in 1999,” Nucleic Acids Research, vol. 27, no. 1, pp. 215–219, 1999.
L. Holm and C. Sander, “Protein structure comparison by alignment of distance matrices,” Journal of Molecular Biology, vol. 233, pp. 123–138, 1993.
A.K. Jain and R.C. Dubes, Algorithms for clustering data, Prentice-Hall, 1988.
Jalview Homepage: http://circinus.ebi.ac.uk:6543/jalview/help.html.
F. Jeanmougin, J.D. Thompson, M. Gouy, D.G. Higgins, and T.J. Gibson, “Multiple sequence alignment with clustal X,” Trends in Biochemical Sciences, vol. 23, pp. 403–405, 1998.
T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A literature network of human genes for high-throughput analysis of gene expression,” Nature Genetics, vol. 28, no. 1, pp. 21–28, 2001.
D.T. Jones, “GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences,” Journal of Molecular Biology, vol. 287, pp. 797–815, 1999.
D.T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices,” Journal of Molecular Biology, vol. 292, pp. 195–202, 1999.
D.T. Jones, W.R. Taylor, and J.R. Thornton, “A model recognition approach to the prediction of all-helical membran protein structure and topology,” Biochemestry, vol. 33, pp. 3038–3049, 1994.
P. Karp, “A strategy for database interoperation,” Journal of Computational Biology, vol. 2, no. 4, pp. 573–586, 1995.
D.G. Kneller, F.E. Cohen, and R. Langridge, “Improvements in protein secondary structure prediction by an enhanced neural network,” Journal of Molecular Biology, vol. 214, pp. 171–182, 1990.
T. Kohonen, Self-Organization and Associative Memory, Springer Verlag: Berlin, 1984.
R. Koradi, M. Billeter, and K. Wüthrich, “MOLMOL: A program for display and analysis of macromolecular structures,” Journal of Molecular Graphics and Modelling, vol. 14, pp. 51–55, 1996.
A. Krogh, B. Larsson, G. von Heijne, and E.L. Sonnhammer, “Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes,” Journal of Molecular Biology, vol. 305, no. 3, pp. 567–580, 2001.
J. Kyte and R.F. Doolittle, “A simple method for displaying the hydropathic character of a protein,” Journal of Molecular Biology, vol. 157, no. 1, pp. 105–132, 1982.
L.V.S. Lakshmanan, F. Sadri, and I.N. Subramanian, “SchemaSQL: A language for interoperability in relational multidatabase systems,” in Proceedings of the 22nd International Conference on Very Large Databases (VLDB'96), 1996, pp. 239–250.
H. Lehväslaiho, M. Ashburner, and T. Etzold, “Unified access to mutation databases,” Trends in Genetics, vol. 14, no. 5, pp. 205–206, 1998.
S. Letovsky, R.W. Cottingham, C.J. Porter, and P.W.D. Li, “GDB: The human genome database,” Nucleic Acids Research, vol. 26, no. 1, pp. 94–99, 1998.
O. Lund, K. Frimand, J. Gorodkin, H. Bohr, J. Bohr, J. Hansen, and S. Brunak, “Protein distance constraints predicted by neural networks and probability density functions,” Protein Engineering, vol. 10, no. 11, pp. 1241–1248, 1997.
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, vol. 1, pp. 281–297, 1967.
V. Markowitz, I.-M. Chen, A. Kosky, and E. Szeto, “Facilities for exploring molecular biology databases on the web: A comparative study,” in Pacific Symposium on Biocomputing (PSB'97), 1997, pp. 256–267.
M.A. Marti-Renom, A. Stuart, A. Fiser, R. Sánchez, F. Melo, and A. Sali, “Comparative protein structure modeling of genes and genomes,” Annual Review Biophysics and Biomolecular Structures, vol. 29, pp. 291–325, 2000.
D.C. McArthur, “An extensible XML schema definition for automated exchange of protein data: PROXIML (PROtein eXtensIble Markup Language),” http://www.cse.ucsc.edu/ douglas/proximl/.
R. McEntire, P. Karp, N. Abernethy, D. Benton, G. Helt, M. DeJongh, R. Kent, A. Kosky, S. Lewis, D. Hodnett, E. Neumann, F. Olken, D. Pathak, P. Tarczy-Hornoch, L. Toldo, and T. Topaloglou, “An evaluation of ontology exchange languages for bioinformatics,” in Proceedings of the 8th International Conference on Intelligent Systems in Molecular Biology (ISMB'00), 2000, pp. 239–250.
C. Medigue, A. Viari, A. Henaut, and A. Danchin, “Colibri: A functional database for the Escherichia coli genome,” Microbiology and Molecular Biology Reviews, vol. 57, no. 3, pp. 623–654, 1992.
H.-W. Mewes, D. Frishman, C. Gruber, B. Geier, D. Haase, A. Kaps, K. Lemcke, G. Mannhaupt, F. Pfeiffer, C. Schüller, S. Stocker, and B. Weil, “MIPS: A database for genomes and protein sequences,” Nucleic Acids Research, vol. 28, no. 1, pp. 37–40, 2000.
MolScript Homepage: http://www.avatar.se/molscript/.
Motif Homepage: http://motif.genome.ad.jp/.
S.B. Needleman and C.D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, pp. 443–453, 1970.
Patscan Homepage: http://www-unix.mcs.anl.gov/compbio/PatScan/HTML/patscan.html.
W. Pearson and D. Lipman, “Improved tools for biological sequence comparison,” in Proceedings of the National Academy of Science USA (PNAS), vol. 85, pp. 2444–2448, 1988.
M.C. Peitsch, “ProMod and swiss-model: Internet-based tools for automated comparative protein modelling,” Biochemical Society Transactions, vol. 24, pp. 274–279, 1996.
G. Perrière, P. Bessières, and B. Labedan, “EMGLib: The enhanced microbial genomes library (Update 2000),” Nucleic Acids Research, vol. 28, no. 1, pp. 68–71, 2000.
Predator Homepage: http://www.embl-heidelberg.de/argos/predator/predator info.html.
D.S. Prestridge, “SIGNAL SCAN: A computer program that scans DNA sequences for eukaryotic transcriptional elments,” CABIOS, vol. 7, pp. 203–206, 1991.
ProFit Homepage: http://www.bioinf.org.uk/software/.
M. Prokop, J. Damborsky, and J. Koca, “TRITON: In Silico construction of protein mutants and prediction of their activities,” Bioinformatics, vol. 16, pp. 845–846, 2000.
Promotor Scan Homepage: http://bimas.dcrt.nih.gov/molbio/proscan/index.html.
Protein Structure Prediction Center: http://predictioncenter.llnl.gov/.
PubMed Database: http://www.ncbi.nlm.nih.gov/PubMed/.
Readseq Homepage: http://www.nih.go.jp/%7Ejun/cgi-bin/readseq.pl.
F. Rechenmann, “Knowledge bases and computational biology,” in Towards Very Large Knowledge Bases, N. Mars (Ed.), IOS Press, 1995, pp. 1–12.
I.T. Rombel, K.F. Sykes, S. Rayner, and S.A. Johnston, “ORF-FINDER: A vector for high-throughput gene identification,” Gene, vol. 282, nos. 1/2, pp. 33–41, 2002.
B. Rost, “Review: Protein secondary structure prediction continues to rise,” Journal of Structural Biology, vol. 134, nos. 2/3, pp. 204–218, 2001.
B. Rost and C. Sander, “Prediction of protein secondary structure at better than 70% accuracy,” Journal of Molecular Biology, vol. 232, pp. 584–599, 1993.
RPFOLD Homepage: http://www.imtech.res.in/raghava/rpfold/.
K.-U. Sattler, S. Conrad, and G. Saake, “Adding conflict resolution features to a query language for database federations,” Australian Journal of Information Systems, vol. 8, no. 1, pp. 116–125, 2000.
R. Sayle and E.J. Milner-White, “RasMol: Biomolecular graphics for all,” Trends in Biochemical Sciences, vol. 20, no. 9, p. 374, 1995.
S. Schwartz, Z. Zhang, K.A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller, “PipMaker: A web server for aligning two genomic DNA sequences,” Genome Research, vol. 10, no. 4, pp. 577–586, 2000.
SFgate Homepage:http://ls6-www.informatik.uni-dortmund.de/ir/projects/SFgate/#intro.
A.P. Sheth and J.A. Larson, “Federated database systems for managing distributed, heterogeneous, and automated databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 183–196, 1990.
J. Shi, T.L. Blundell, and K. Mizuguchi, “FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties,” Journal of Molecular Biology, vol. 310, pp. 243–257, 2001.
A.S. Siddiqui, U. Dengler, and G.J. Barton, “3Dee:Adatabase of protein structural domains,” Bioinformatics, vol. 17, pp. 200–201, 2001.
T.F. Smith and M.S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, pp. 195–197, 1981.
R.F. Smith, B.A. Wiese, M.K. Wojzynski, D.B. Davison, and K.C. Worley, “BCM search launcher--an integrated interface to molecular biology data base search and analysis services available on the world wide web,” Genome Research, vol. 6, no. 5, pp. 454–462, 1996.
S. Spaccapietra, C. Parent, and Y. Dupont, “Model independent assertions for integration of heterogeneous schemas,” VLDB Journal, vol. 1, no. 1, pp. 81–126, 1992.
SRS User Guide, 2000, /srs6/doc/srsuser.pdf.
S.A. Sullivan, L. Aravind, I. Makalowska, A.D. Baxevanis, and D. Landsman, “The histone database: A comprehensiveWWWresource for histones and histone fold-containing proteins,” Nucleic Acids Research, vol. 28, no. 1, pp. 320–322, 2000.
R.M. Sweet and D. Eisenberg, “Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure,” Journal of Molecular Biology, vol. 171, no. 4, pp. 479–488, 1983.
Y. Tateno, S. Miyazaki, M. Ota, H. Sugawara, and T. Gojobori, “DNA data bank of Japan (DDBJ) in collaboration with mass sequencing teams,” Nucleic Acids Research, vol. 28, no. 1, pp. 24–26, 2000.
J.D. Thompson, D.G. Higgins, and T.J. Gibson, “CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice,” Nucleic Acids Research, vol. 22, pp. 4673–4680, 1994.
S. Tsur, “Data mining in the bioinformatics domain,” in Proceedings of the 26th Conference on Very Large Databases (VLDB'00), 2000.
J. van Helden, A. Naim, R. Mancuso, M. Eldridge, L. Wernisch, D. Gilbert, and S.J. Wodak, “Representing and analysing molecular and cellular function in the computer,” Biological Chemistry, vol. 381, pp. 921–935, 2000.
J. Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen, “Mining for putative regulatory elements in the yeast genome using gene expression data,” in Proceedings of the 8th International Conference on Intelligent Systems in Molecular Biology (ISMB'00), 2000, pp. 384–394.
G. von Heijne, “Membrane protein structure prediction: Hydrophobicity analysis and the 'Positive Inside' rule,” Journal of Molecular Biology, vol. 225, pp. 487–494, 1992.
A.C. Wallace, R.A. Laskowski, and J.M. Thornton, “LIGPLOT: A program to generate schematic diagrams of protein-ligand interactions,” Protein Engineering, vol. 8, pp. 127–134, 1995.
Y. Wang, L.Y. Geer, C. Chappey, J.A. Kans, and S.H. Bryant, “Cn3D: Sequence and structure views for entrez,” Trends in Biochemical Sciences, vol. 25, no. 6, pp. 300–302, 2000.
Wise2 Homepage: http://www.sanger.ac.uk/Software/Wise2/.
XEMBL Project: http://www.ebi.ac.uk/xembl/.
G. Xie, R. DeMarco, R. Blevins, and Y. Wang, “Storing biological sequence databases in relational form,” Bioinformatics, vol. 16, no. 2, pp. 288–289, 2000.
Y. Xu, R.J. Mural, and E.C. Uberbacher, “Inferring gene structures in genomic sequences using pattern recognition and expressed sequence tags,” in Proceedings of the 5th International Conference on Intelligent Systems in Molecular Biology (ISMB' 97), 1997, pp. 344–353.
R. Zimmer and T. Lengauer, “Protein structure prediction,” in Bioinformatics--From Genomes to Drugs, T. Lengauer (Ed.), Vol. 1: Basic Technologies, Wiley-VCH., 2002.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bry, F., Kröger, P. A Computational Biology Database Digest: Data, Data Analysis, and Data Management. Distributed and Parallel Databases 13, 7–42 (2003). https://doi.org/10.1023/A:1021540705916
Issue Date:
DOI: https://doi.org/10.1023/A:1021540705916