Abstract
Biological research requires information from multiple data sources that use a variety of database-specific formats. Manual gathering of information is time consuming and error-prone, making automated data aggregation a compelling option for large studies. We describe a method for extracting information from diverse sources that involves structural rules specified by example. We developed a system for aggregation of biological knowledge (ABK) and used it to conduct an epidemiological study of dengue virus (DENV) sequences. Additional information on geographical origin and isolation date is critical for understanding evolutionary relationships, but this data is inconsistently structured in database entries. Using three public databases, we found that structural rules can be used successfully even when applied on inconsistently structured data that is distributed across multiple fields. High reusability, combined with the ability to integrate analysis tools, make this method suitable for a wide variety of large-scale studies involving viral sequences.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Galperin, M.Y.: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res. 33, D5–D24 (2005)
Benson, D.A., et al.: GenBank. Nucleic Acids Res. 33, D34–D38 (2005)
Brazma, A.: On the importance of standardisation in life sciences. Bioinformatics 17, 113–114 (2001)
Karp, P.D., Paley, S., Zhu, J.: Database verification studies of SWISS-PROT and GenBank. Bioinformatics 17, 526–532 (2001)
Heimbigner, D., McLeod, D.: A Federated Architecture for Information Management. ACM Transactions on Information Systems 3, 253–278 (1985)
Widom, J.: Research Problems in Data Warehousing. In: Proc. of the Int. Conf. on Information and Knowledge Management (CIKM 1995), Baltimore, USA, pp. 25–30 (1995)
Wiederhold, G.: Mediators in the Architecture of Future Information Systems. IEEE Computer 25, 38–49 (1992)
Zdobnov, E.M., et al.: The EBI SRS server- new features. Bioinformatics 18, 1149–1150 (2002)
Chung, S.Y., Wong, L.: Kleisli: a new tool for data integration in biology. Trends Biotechnol. 17, 351–355 (1999)
Kasprzyk, A., et al.: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 14, 160–169 (2004)
Koh, J.L.Y., et al.: BioWare: A framework for bioinformatics data retrieval, annotation and publishing. In: ACM SIGIR Workshop on Search and Discovery in Bioinformatics (SIGIRBIO), Sheffield, UK (2004)
Yergeau, F., et al.: Extensible Markup Language (XML) 1.0, 3rd edn. (2004), http://www.w3.org/TR/2004/REC-xml-20040204/
ClarkJ., D.S.: XML Path Language (XPath) Version 1.0 (1999), http://www.w3c.org/TR/xpath
Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005)
dos Santos, C.N., et al.: Genome analysis of dengue type-1 virus isolated between 1990 and 2001 in Brazil reveals a remarkable conservation of the structural proteins but amino acid differences in the non-structural proteins. Virus Res. 90, 197–205 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Miotto, O., Tan, T.W., Brusic, V. (2005). Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_52
Download citation
DOI: https://doi.org/10.1007/11508069_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26972-4
Online ISBN: 978-3-540-31693-0
eBook Packages: Computer ScienceComputer Science (R0)