Skip to main content

Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Abstract

Biological research requires information from multiple data sources that use a variety of database-specific formats. Manual gathering of information is time consuming and error-prone, making automated data aggregation a compelling option for large studies. We describe a method for extracting information from diverse sources that involves structural rules specified by example. We developed a system for aggregation of biological knowledge (ABK) and used it to conduct an epidemiological study of dengue virus (DENV) sequences. Additional information on geographical origin and isolation date is critical for understanding evolutionary relationships, but this data is inconsistently structured in database entries. Using three public databases, we found that structural rules can be used successfully even when applied on inconsistently structured data that is distributed across multiple fields. High reusability, combined with the ability to integrate analysis tools, make this method suitable for a wide variety of large-scale studies involving viral sequences.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Galperin, M.Y.: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res. 33, D5–D24 (2005)

    Article  Google Scholar 

  2. Benson, D.A., et al.: GenBank. Nucleic Acids Res. 33, D34–D38 (2005)

    Article  Google Scholar 

  3. Brazma, A.: On the importance of standardisation in life sciences. Bioinformatics 17, 113–114 (2001)

    Article  Google Scholar 

  4. Karp, P.D., Paley, S., Zhu, J.: Database verification studies of SWISS-PROT and GenBank. Bioinformatics 17, 526–532 (2001)

    Article  Google Scholar 

  5. Heimbigner, D., McLeod, D.: A Federated Architecture for Information Management. ACM Transactions on Information Systems 3, 253–278 (1985)

    Article  Google Scholar 

  6. Widom, J.: Research Problems in Data Warehousing. In: Proc. of the Int. Conf. on Information and Knowledge Management (CIKM 1995), Baltimore, USA, pp. 25–30 (1995)

    Google Scholar 

  7. Wiederhold, G.: Mediators in the Architecture of Future Information Systems. IEEE Computer 25, 38–49 (1992)

    Google Scholar 

  8. Zdobnov, E.M., et al.: The EBI SRS server- new features. Bioinformatics 18, 1149–1150 (2002)

    Article  Google Scholar 

  9. Chung, S.Y., Wong, L.: Kleisli: a new tool for data integration in biology. Trends Biotechnol. 17, 351–355 (1999)

    Article  Google Scholar 

  10. Kasprzyk, A., et al.: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 14, 160–169 (2004)

    Article  Google Scholar 

  11. Koh, J.L.Y., et al.: BioWare: A framework for bioinformatics data retrieval, annotation and publishing. In: ACM SIGIR Workshop on Search and Discovery in Bioinformatics (SIGIRBIO), Sheffield, UK (2004)

    Google Scholar 

  12. Yergeau, F., et al.: Extensible Markup Language (XML) 1.0, 3rd edn. (2004), http://www.w3.org/TR/2004/REC-xml-20040204/

  13. ClarkJ., D.S.: XML Path Language (XPath) Version 1.0 (1999), http://www.w3c.org/TR/xpath

  14. Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005)

    Article  Google Scholar 

  15. dos Santos, C.N., et al.: Genome analysis of dengue type-1 virus isolated between 1990 and 2001 in Brazil reveals a remarkable conservation of the structural proteins but amino acid differences in the non-structural proteins. Virus Res. 90, 197–205 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Miotto, O., Tan, T.W., Brusic, V. (2005). Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_52

Download citation

  • DOI: https://doi.org/10.1007/11508069_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26972-4

  • Online ISBN: 978-3-540-31693-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics