Abstract
Due to the complexity of genomic information and the broad amount of data produced every day, the genomic information accessible on the web has become very difficult to integrate, which hinders the research process. Using the knowledge from the Data Quality field and after a specific study of a set of genomic databases we have found problems related to six Data Quality dimensions. The aim of this paper is to highlight the problems that bioinformaticians have to face when they integrate information from different genomic databases. The contribution of this paper is to identify and characterize those problems in order to understand which ones hinder the research process, increasing the time-waste that this task means for researchers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Askham, N., Cook, D., Doyle, M., Fereday, H., Gibson, M., Landbeck, U., Lee, R., Maynard, C., Palmer, G., Schwarzenbach, J.: The six primary dimensions for data quality assessment. Technical report, DAMA UK Working Group (2013)
Barker, N., Clevers, H.: Quality control in databanks for molecular biology. BioEssays 22(11), 1024–1034 (2000)
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv 41(3), 1–52 (2009)
Eckerson W.: Data quality and the bottom line. TDWI Report. The Data Warehouse Institute (2002)
Growth of sequence and 3D structure databases. http://www.kanehisa.jp/en/db_growth.html
Jones, C., Brown, A., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinform. 8(1), 170 (2007)
Koh, J., Lee, M., Khan, A., Tan, P., Brusic, V.: Duplicate detection in biological data using association rule mining. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics (2004)
Krawetz, S.: Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucleic Acids Res. 17(10), 3951–3957 (1989)
Loshin, D.: The Practitioner’s Guide to Data Quality Improvement. A Volume in MK Series on Business Intelligence, pp. 115–128 (2011)
Moran, L.: Sandwalk: Errors in Sequence Databases (2008)
NCBI is phasing out sequence GIs - use Accession. Version instead! https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/
Pastor, O.: Conceptual modeling meets the human genome. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 1–11. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87877-3_1
Scannapieco, M., Missier, P., Batini, C.: Data quality at aGlance. Datenbank-Spektrum 14, 6–14 (2005)
Schnoes, A., Brown, S., Dodevski, I., Babbitt, P.: Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)
Smith, B.E., Johnston, M.K., Lucking, R.: From GenBank to GBIF: phylogeny-based predictive niche modeling tests accuracy of taxonomic identifications in large occurrence data repositories. PLoS ONE 11(3), e0151232 (2016)
Soh, D., Dong, D., Guo, Y., Wong, L.: Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinform. 11(1), 449 (2010)
The ClinVar record display. https://www.ncbi.nlm.nih.gov/clinvar/docs/details/#review_status
The Ensembl project. http://www.ensembl.org/info/about/index.html
Triplet, T., Butler, G.: Systems biology warehousing: challenges and strategies toward effective data integration. In: Proceedings of the 3rd International Conference on Advances in Databases, Knowledge and Data Applications, pp. 34–40 (2011)
Uniparc. http://www.uniprot.org/help/uniparc
Uniprot knowledgebase. http://www.uniprot.org
UniProt: reducing proteome redundancy. http://www.uniprot.org/help/proteome_redundancy
UniProt: how redundant are the uniprot databases? http://www.uniprot.org/help/redundancy
Uniprot key staff. http://www.uniprot.org/help/key_staff
UniProt: current release statistics. https://www.ebi.ac.uk/uniprot/TrEMBLstats
UniProt: protein existence. http://www.uniprot.org/help/protein_existence
Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 86–95 (1995)
Wang, R., Strong, D.: Beyond accuracy: what data quality means to data consumers. J. Manage. Inform. Syst. 12(4), 5–33 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
León, A., Reyes, J., Burriel, V., Valverde, F. (2016). Data Quality Problems When Integrating Genomic Information. In: Link, S., Trujillo, J. (eds) Advances in Conceptual Modeling. ER 2016. Lecture Notes in Computer Science(), vol 9975. Springer, Cham. https://doi.org/10.1007/978-3-319-47717-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-47717-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47716-9
Online ISBN: 978-3-319-47717-6
eBook Packages: Computer ScienceComputer Science (R0)