Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources

Miotto, Olivo; Tan, Tin Wee; Brusic, Vladimir

doi:10.1007/11508069_52

Olivo Miotto^19,20,21,
Tin Wee Tan²⁰ &
Vladimir Brusic^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1359 Accesses

Abstract

Biological research requires information from multiple data sources that use a variety of database-specific formats. Manual gathering of information is time consuming and error-prone, making automated data aggregation a compelling option for large studies. We describe a method for extracting information from diverse sources that involves structural rules specified by example. We developed a system for aggregation of biological knowledge (ABK) and used it to conduct an epidemiological study of dengue virus (DENV) sequences. Additional information on geographical origin and isolation date is critical for understanding evolutionary relationships, but this data is inconsistently structured in database entries. Using three public databases, we found that structural rules can be used successfully even when applied on inconsistently structured data that is distributed across multiple fields. High reusability, combined with the ability to integrate analysis tools, make this method suitable for a wide variety of large-scale studies involving viral sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Uncertain Groupings: Probabilistic Combination of Grouping Data

From Prebase in Automata Theory to Data Analysis: Boris Mirkin’s Way

CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay

Article Open access 01 June 2022

References

Galperin, M.Y.: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res. 33, D5–D24 (2005)
Article Google Scholar
Benson, D.A., et al.: GenBank. Nucleic Acids Res. 33, D34–D38 (2005)
Article Google Scholar
Brazma, A.: On the importance of standardisation in life sciences. Bioinformatics 17, 113–114 (2001)
Article Google Scholar
Karp, P.D., Paley, S., Zhu, J.: Database verification studies of SWISS-PROT and GenBank. Bioinformatics 17, 526–532 (2001)
Article Google Scholar
Heimbigner, D., McLeod, D.: A Federated Architecture for Information Management. ACM Transactions on Information Systems 3, 253–278 (1985)
Article Google Scholar
Widom, J.: Research Problems in Data Warehousing. In: Proc. of the Int. Conf. on Information and Knowledge Management (CIKM 1995), Baltimore, USA, pp. 25–30 (1995)
Google Scholar
Wiederhold, G.: Mediators in the Architecture of Future Information Systems. IEEE Computer 25, 38–49 (1992)
Google Scholar
Zdobnov, E.M., et al.: The EBI SRS server- new features. Bioinformatics 18, 1149–1150 (2002)
Article Google Scholar
Chung, S.Y., Wong, L.: Kleisli: a new tool for data integration in biology. Trends Biotechnol. 17, 351–355 (1999)
Article Google Scholar
Kasprzyk, A., et al.: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 14, 160–169 (2004)
Article Google Scholar
Koh, J.L.Y., et al.: BioWare: A framework for bioinformatics data retrieval, annotation and publishing. In: ACM SIGIR Workshop on Search and Discovery in Bioinformatics (SIGIRBIO), Sheffield, UK (2004)
Google Scholar
Yergeau, F., et al.: Extensible Markup Language (XML) 1.0, 3rd edn. (2004), http://www.w3.org/TR/2004/REC-xml-20040204/
ClarkJ., D.S.: XML Path Language (XPath) Version 1.0 (1999), http://www.w3c.org/TR/xpath
Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005)
Article Google Scholar
dos Santos, C.N., et al.: Genome analysis of dengue type-1 virus isolated between 1990 and 2001 in Brazil reveals a remarkable conservation of the structural proteins but amino acid differences in the non-structural proteins. Virus Res. 90, 197–205 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, Singapore, 119615
Olivo Miotto
Department of Biochemistry, Faculty of Medicine, National University of Singapore, 10 Medical Drive, Singapore, 117597
Olivo Miotto, Tin Wee Tan & Vladimir Brusic
Institute for Infocomm Research, Singapore, 21 Heng Mui Keng Terrace, Singapore, 119613
Olivo Miotto & Vladimir Brusic

Authors

Olivo Miotto
View author publications
You can also search for this author in PubMed Google Scholar
Tin Wee Tan
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Brusic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, 4072, Australia
Marcus Gallagher
, POB 30031, FL 32503-1031, Pensacola
James P. Hogan
Faculty of Information Technology, Queensland University of Technology, Box 2434, Q 4001, Brisbane, Australia
Frederic Maire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miotto, O., Tan, T.W., Brusic, V. (2005). Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_52

Download citation

DOI: https://doi.org/10.1007/11508069_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26972-4
Online ISBN: 978-3-540-31693-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources

Abstract

Access this chapter

Preview

Similar content being viewed by others

Uncertain Groupings: Probabilistic Combination of Grouping Data

From Prebase in Automata Theory to Data Analysis: Boris Mirkin’s Way

CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources

Abstract

Access this chapter

Preview

Similar content being viewed by others

Uncertain Groupings: Probabilistic Combination of Grouping Data

From Prebase in Automata Theory to Data Analysis: Boris Mirkin’s Way

CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation