Abstract
We present BIODQ, a model for estimating and managing the quality of biological data in genomics repositories. BIODQ uses our Quality Estimation Model (QEM) which has been implemented as part of the Quality Management Architecture (QMA). The QEM consists of a set of quality dimensions and their quantitative measures. The QMA combines a series of software components that enable the integration of QEM with existing genomics repositories. The basis of our experimental evaluation is a research study conducted among biologists. Evaluation results show that the QEM dimensions and estimations are biologically-relevant and useful for discriminating high quality from low quality data. The most relevant capabilities of the QMA are also presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 35(Database issue), D21–D25 (2007)
Pruitt, K.D., Tatusova, T., Maglott, D.: NCBI reference sequences (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35(Database issue), D61–D65 (2007)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31(1), 365–370 (2003)
Wheeler, D.L., Barret, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L.Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D.J., Madden, T.L., Maglott, D.R., Ostell, J., Miller, V., Pruitt, K.D., Schuler, G.D., Sequeira, E., Sherry, S.T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L., Tatusova, T.A., Wagner, L., Yaschenko, E.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35(Database issue), D5–D12 (2007)
Lee, Y.W., Strong, D.M., Kahn, B.K., Wang, R.Y.: AIMQ: A Methodology for Information Quality Assessment. Information and Management 40(2), 133–146 (2002)
Naumann, F., Rolker, C.: Assessment Methods for Information Quality Criteria. In: Proceedings of the International Conference on Information Quality, pp. 148–162 (2000)
Mecella, M., Scannapieco, M., Virgillito, A., Baldoni, R., Catarci, T., Batini, C.: Managing Data Quality in Cooperative Information Systems. In: Spaccapietra, S., March, S., Aberer, K. (eds.) Journal on Data Semantics I. LNCS, vol. 2800, pp. 208–232. Springer, Heidelberg (2003)
Scannapieco, M., Virgillito, A., Marchetti, M., Mecella, M., Baldoni, R.: The DaQuinCIS Architecture: A Platform for Exchanging and Improving Data Quality in Cooperative Information Systems. Information Systems 29(7), 551–582 (2004)
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Information Systems 29(7), 583–615 (2004)
Müller, H., Naumann, F., Freytag, J.C.: Data Quality in Genome Databases. In: Proceedings of the International Conference on Information Quality, pp. 269–284 (2003)
Schmutz, J., Wheeler, J., Grimwood, J., Dickson, M., Yang, J., Caoile, C., Bajorek, E., Black, S., Chan, Y.M., Denys, M., Escobar, J., Flowers, D., Fotopulos, D., Garcia, C., Gomez, M., Gonzales, E., Haydu, L., Lopez, F., Ramirez, L., Retterer, J., Rodriguez, A., Rogers, S., Salazar, A., Tsai, M., Myers, R.M.: Quality assessment of the human genome sequence. Nature 429(6990), 365–368 (2004)
Missier, P., Embury, S., Greenwood, M., Preece, A., Jin, B.: Quality views: Capturing and exploiting the user perspective on data quality. In: Proceedings of the VLDB, pp. 977–988 (2006)
Preece, A.D., Jin, B., Pignotti, E., Missier, P., Embury, S.M., Stead, D., Brown, A.: Managing Information Quality in e-Science Using Semantic Web Technology. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 472–486. Springer, Heidelberg (2006)
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, CA (2000)
Introduction to ASN.1, http://asn1.elibel.tm.fr/en/introduction/
Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6 (October 2000), http://www.w3.org/TR/2000/REC-xml-20001006
INSDC Feature Table Definition Document, http://www.insdc.org/files/feature_table.html
International Nucleotide Sequence Database Collaboration, http://www.insdc.org/
Martinez, A., Hammer, J.: BIODQ: A Model for Data Quality Estimation and Management in Biological Databases. Doctoral Thesis, University of Florida (2007)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA (1993)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martinez, A., Hammer, J., Ranka, S. (2008). BioDQ: Data Quality Estimation and Management for Genomics Databases. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2008. Lecture Notes in Computer Science(), vol 4983. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79450-9_44
Download citation
DOI: https://doi.org/10.1007/978-3-540-79450-9_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79449-3
Online ISBN: 978-3-540-79450-9
eBook Packages: Computer ScienceComputer Science (R0)