Columba: Multidimensional Data Integration of Protein Annotations

Rother, Kristian; Müller, Heiko; Trissl, Silke; Koch, Ina; Steinke, Thomas; Preissner, Robert; Frömmel, Cornelius; Leser, Ulf

doi:10.1007/978-3-540-24745-6_11

Kristian Rother⁸,
Heiko Müller⁹,
Silke Trissl⁹,
Ina Koch¹⁰,
Thomas Steinke¹¹,
Robert Preissner⁸,
Cornelius Frömmel⁸ &
…
Ulf Leser⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 2994))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

387 Accesses
5 Citations

Abstract

We present COLUMBA, an integrated database of protein annotations. COLUMBA is centered around proteins whose structure has been resolved and adds as much annotations as possible to those proteins, describing their proper-ties such as function, sequence, classification, textual description, participation in pathways, etc. Annotations are extracted from seven (soon eleven) external data sources. In this paper we describe the motivation for building COLUMBA, its integrational architecture and the software tools we developed for the integrated data sources and keeping COLUMBA up-to-date. We put special focus on two aspects: First, COLUMBA does not try to remove redundancies and overlaps in data sources, but views each data source as a proper dimension describing a protein. We explain the advantages of this approach compared to a tighter semantic integration as pursued in many other projects. Second, we highlight our current investigations regarding the quality of data in COLUMBA by identification of hot spots of poor data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542 (1977)
Article Google Scholar
Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)
Article Google Scholar
Devos, D., Valencia, A.: Intrinsic errors in genome annotation. Trends in Genetics 17(8), 429–431 (2001)
Article Google Scholar
Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., Ouzounis, C.A.: Modeling the percolation of annotation erros in a database of protein sequences. Bioinformatics 18(12), 1641–1649 (2002)
Article Google Scholar
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG database at GenomeNet. Nucleic Acid Research 30(1), 42–46 (2002)
Article Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Google Scholar
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern of recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)
Article Google Scholar
Bairoch, A.: The ENZYME database. Nucleic Acid Research 28(1), 304–305 (2000)
Article Google Scholar
Preissner, R., Goede, R., Froemmel, C.: Dictionary of interfaces in proteins (DIP). Databank of complementary molecular surface patches. J. Mol. Biol. 280(3), 535–550 (1998)
Article Google Scholar
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH- A Hierarchic Classification of Protein Domain Structures. Structure 5(8), 1093–1108 (1997)
Article Google Scholar
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Research 31(1), 365–370 (2003)
Article Google Scholar
Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., Wagner, L.: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31(1), 28–33 (2003)
Article Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1), 25–29 (2000)
Google Scholar
Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003)
Article Google Scholar
Krause, A., Stoye, J., Vingron, M.: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 28(1), 270–272 (2000)
Article Google Scholar
Michal, G.: Biochemical Pathways, Boehringer Mannheim GmbH (1993)
Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 31(1), 23–37 (2003)
Article Google Scholar
Lakshmanan, L., Sadri, F., Subramanian, I.: On the Logical Foundation of Schema Integration and Evolution in Heterogeneous Database Systems. In: Ceri, S., Tsur, S., Tanaka, K. (eds.) DOOD 1993. LNCS, vol. 760, pp. 81–100. Springer, Heidelberg (1993)
Google Scholar
Do, H.H., Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches. In: Conference on Very Large Data Bases(VLDB), pp. 610–621 (2002)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10, 334–350 (2001)
Article MATH Google Scholar
Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 65–74 (1997)
Article Google Scholar
Greer, D.S., Westbrook, J.D., Bourne, P.E.: An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics 18(9), 1280–1281 (2002)
Article Google Scholar
Rahm, E., Do, H.H.: Data Cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4) (2000)
Google Scholar
Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Scheider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H.M.: The PDB data uniformity project. Nucleic Acid Research 29(1), 214–218 (2001)
Article Google Scholar
Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Urunea, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research 31(1), 458–462 (2003)
Article Google Scholar
Stein, L.: Creating a bioinformatics nation. Nature 417(6885), 119–120 (2002)
Article Google Scholar
Laskowski, R.A.: PDBsum: summaries and analyses of PDB structures. Nucleic Acids Research 29(1), 221–222 (2001)
Article Google Scholar
Reichert, J., Suhnel, J.: The IMB Jena Image Library of Biological Macromolecules: 2002 update. Nucleic Acids Res. 30(1), 253–254 (2002)
Article Google Scholar
Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31(13), 3784–3788 (2003)
Article Google Scholar
Cornell, M., Paton, N.W., Shengli, W., Goble, C.A., Miller, C.J., Kirby, P., Eilbeck, K., Brass, A., Hayes, A., Oliver, S.G.: GIMS - A Data Warehouse for Storage and Analysis of Genome Sequence and Function Data. In: 2nd IEEE International Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland (2001)
Google Scholar
Paton, N.W., Khan, S.A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C.A., Hubbard, S.J., Oliver, S.G.: Conceptual Modelling of Genomic Information. Bioinformatics 16(6), 548–557 (2000)
Article Google Scholar
Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Biochemie, Universitätskrankenhaus Charité Berlin, Monbijoustr. 2, 10098, Berlin, Germany
Kristian Rother, Robert Preissner & Cornelius Frömmel
Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Heiko Müller, Silke Trissl & Ulf Leser
Technische Fachhochschule Berlin, Seestr. 64, 13347, Berlin, Germany
Ina Koch
Zuse Institut Berlin, Takustrasse 7, Berlin, Germany
Thomas Steinke

Authors

Kristian Rother
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Müller
View author publications
You can also search for this author in PubMed Google Scholar
Silke Trissl
View author publications
You can also search for this author in PubMed Google Scholar
Ina Koch
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Steinke
View author publications
You can also search for this author in PubMed Google Scholar
Robert Preissner
View author publications
You can also search for this author in PubMed Google Scholar
Cornelius Frömmel
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, University of Leipzig,
Erhard Rahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rother, K. et al. (2004). Columba: Multidimensional Data Integration of Protein Annotations. In: Rahm, E. (eds) Data Integration in the Life Sciences. DILS 2004. Lecture Notes in Computer Science(), vol 2994. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24745-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-24745-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21300-0
Online ISBN: 978-3-540-24745-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics