Skip to main content

Data Quality Problems When Integrating Genomic Information

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9975))

Abstract

Due to the complexity of genomic information and the broad amount of data produced every day, the genomic information accessible on the web has become very difficult to integrate, which hinders the research process. Using the knowledge from the Data Quality field and after a specific study of a set of genomic databases we have found problems related to six Data Quality dimensions. The aim of this paper is to highlight the problems that bioinformaticians have to face when they integrate information from different genomic databases. The contribution of this paper is to identify and characterize those problems in order to understand which ones hinder the research process, increasing the time-waste that this task means for researchers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Askham, N., Cook, D., Doyle, M., Fereday, H., Gibson, M., Landbeck, U., Lee, R., Maynard, C., Palmer, G., Schwarzenbach, J.: The six primary dimensions for data quality assessment. Technical report, DAMA UK Working Group (2013)

    Google Scholar 

  2. Barker, N., Clevers, H.: Quality control in databanks for molecular biology. BioEssays 22(11), 1024–1034 (2000)

    Article  Google Scholar 

  3. Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv 41(3), 1–52 (2009)

    Article  Google Scholar 

  4. ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/

  5. Eckerson W.: Data quality and the bottom line. TDWI Report. The Data Warehouse Institute (2002)

    Google Scholar 

  6. Growth of sequence and 3D structure databases. http://www.kanehisa.jp/en/db_growth.html

  7. Jones, C., Brown, A., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinform. 8(1), 170 (2007)

    Article  Google Scholar 

  8. Koh, J., Lee, M., Khan, A., Tan, P., Brusic, V.: Duplicate detection in biological data using association rule mining. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics (2004)

    Google Scholar 

  9. Krawetz, S.: Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucleic Acids Res. 17(10), 3951–3957 (1989)

    Article  Google Scholar 

  10. Loshin, D.: The Practitioner’s Guide to Data Quality Improvement. A Volume in MK Series on Business Intelligence, pp. 115–128 (2011)

    Google Scholar 

  11. Moran, L.: Sandwalk: Errors in Sequence Databases (2008)

    Google Scholar 

  12. NCBI is phasing out sequence GIs - use Accession. Version instead! https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

  13. Pastor, O.: Conceptual modeling meets the human genome. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 1–11. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87877-3_1

    Chapter  Google Scholar 

  14. Scannapieco, M., Missier, P., Batini, C.: Data quality at aGlance. Datenbank-Spektrum 14, 6–14 (2005)

    Google Scholar 

  15. Schnoes, A., Brown, S., Dodevski, I., Babbitt, P.: Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)

    Google Scholar 

  16. Smith, B.E., Johnston, M.K., Lucking, R.: From GenBank to GBIF: phylogeny-based predictive niche modeling tests accuracy of taxonomic identifications in large occurrence data repositories. PLoS ONE 11(3), e0151232 (2016)

    Google Scholar 

  17. Soh, D., Dong, D., Guo, Y., Wong, L.: Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinform. 11(1), 449 (2010)

    Article  Google Scholar 

  18. The ClinVar record display. https://www.ncbi.nlm.nih.gov/clinvar/docs/details/#review_status

  19. The Ensembl project. http://www.ensembl.org/info/about/index.html

  20. Triplet, T., Butler, G.: Systems biology warehousing: challenges and strategies toward effective data integration. In: Proceedings of the 3rd International Conference on Advances in Databases, Knowledge and Data Applications, pp. 34–40 (2011)

    Google Scholar 

  21. Uniparc. http://www.uniprot.org/help/uniparc

  22. Uniprot knowledgebase. http://www.uniprot.org

  23. UniProt: reducing proteome redundancy. http://www.uniprot.org/help/proteome_redundancy

  24. UniProt: how redundant are the uniprot databases? http://www.uniprot.org/help/redundancy

  25. Uniprot key staff. http://www.uniprot.org/help/key_staff

  26. UniProt: current release statistics. https://www.ebi.ac.uk/uniprot/TrEMBLstats

  27. UniProt: protein existence. http://www.uniprot.org/help/protein_existence

  28. Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 86–95 (1995)

    Article  Google Scholar 

  29. Wang, R., Strong, D.: Beyond accuracy: what data quality means to data consumers. J. Manage. Inform. Syst. 12(4), 5–33 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana León .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

León, A., Reyes, J., Burriel, V., Valverde, F. (2016). Data Quality Problems When Integrating Genomic Information. In: Link, S., Trujillo, J. (eds) Advances in Conceptual Modeling. ER 2016. Lecture Notes in Computer Science(), vol 9975. Springer, Cham. https://doi.org/10.1007/978-3-319-47717-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47717-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47716-9

  • Online ISBN: 978-3-319-47717-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics