Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5518))

Included in the following conference series:

Abstract

Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train named entity recognition systems and perform evaluation.

In this paper we present a chemical entity recognizer that uses a machine learning approach based on conditional random fields (CRF) and compare the performance with dictionary-based approaches using several terminological resources. For the training and evaluation, a gold standard of manually curated patent documents was used. While the dictionary-based systems perform well in partial identification of chemical entities, the machine learning approach performs better (10% increase in F-score in comparison to the best dictionary-based system) when identifying complete entities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yeh, A., Hirschman, L., Morgan, A.: Evaluation of text data mining for database curation: Lessons learned from the KDD challenge cup. Bioinformatics 19(1), i331–i339 (2003)

    Article  Google Scholar 

  2. Hersh, W., Cohen, A., Roberts, P., Rekapalli, H.: TREC 2006 genomics track overview. In: Proc. of the 15th Text REtrieval Conference (2006)

    Google Scholar 

  3. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, S1 (2005)

    Article  Google Scholar 

  4. Hirschman, L., Krallinger, M., Valencia, A.: Proc. of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (2007)

    Google Scholar 

  5. Smith, L., Tanabe, L., Ando, R., Kuo, C., Chung, I., Hsu, C., Lin, Y., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W., Hunter, L., Carpenter, B., Tsai, R., Dai, H., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, M., Mana-Lopez, A., Mata-Vazquez, J., Wilbur, W.: Overview of BioCreative II gene mention recognition. Genome Biology 9(suppl. 1), S2 (2008)

    Article  Google Scholar 

  6. Reyle, U.: Understanding chemical terminology. Terminology 12, 111–126 (2006)

    Article  Google Scholar 

  7. Hanisch, D., Fundel, K., Mevissen, H., Zimmer, R., Fluck, J.: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 6(suppl. 1), S14 (2005)

    Article  Google Scholar 

  8. Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., Stoehr, P.: Ebimed - text crunching to gather facts for proteins from medline. Bioinformatics 23 (2007)

    Google Scholar 

  9. Narayanaswamy, M., Ravikumar, K., Vijay-Shanker, K.: A biological named entity recognizer. In: Proc. of the Pacific Symposium on Biocomputing, pp. 427–438 (2003)

    Google Scholar 

  10. Kemp, N., Lynch, M.: The extraction of information from the text of chemical patents. 1. identification of specific chemical names. J. Chem. Inf. Comput. Sci. 38, 544–551 (1998)

    Article  Google Scholar 

  11. Corbett, P., Murray-Rust, P.: High-throughput identification of chemistry in life science texts. In: Berthold, M.R., Glen, R.C., Fischer, I. (eds.) CompLife 2006. LNCS (LNBI), vol. 4216, pp. 107–118. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alcantara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350 (2008)

    Article  Google Scholar 

  13. Corbett, P., Copestake, A.: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 9(suppl. 11), S4 (2008)

    Article  Google Scholar 

  14. Klinger, R., Kolá, C., Fluck, J., Hofmann-Apitius, M., Friedrich, C.: Detection of IUPAC and IUPAC-like chemical names. ISMB 2008. Bioinformatics 24, i268–i276 (2008)

    Article  Google Scholar 

  15. International Union of Pure and Applied Chemistry, http://www.iupac.org

  16. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), i180–i182 (2003)

    Article  Google Scholar 

  17. Wishart, D., Knox, C., Guo, A., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 34, D668–D672 (2006)

    Article  Google Scholar 

  18. Corbett, P.: OSCAR3 (Open Source Chemistry Analysis Routines) - software for the semantic annotation of chemistry papers, http://sourceforge.net/projects/oscar3-chem

  19. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)

    Google Scholar 

  20. McCallum, A.: MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grego, T., Pęzik, P., Couto, F.M., Rebholz-Schuhmann, D. (2009). Identification of Chemical Entities in Patent Documents. In: Omatu, S., et al. Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living. IWANN 2009. Lecture Notes in Computer Science, vol 5518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02481-8_144

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02481-8_144

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02480-1

  • Online ISBN: 978-3-642-02481-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics