Identification of Chemical Entities in Patent Documents

Grego, Tiago; Pęzik, Piotr; Couto, Francisco M.; Rebholz-Schuhmann, Dietrich

doi:10.1007/978-3-642-02481-8_144

Tiago Grego²³,
Piotr Pęzik²⁴,
Francisco M. Couto²³ &
…
Dietrich Rebholz-Schuhmann²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5518))

Included in the following conference series:

International Work-Conference on Artificial Neural Networks

2836 Accesses

Abstract

Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train named entity recognition systems and perform evaluation.

In this paper we present a chemical entity recognizer that uses a machine learning approach based on conditional random fields (CRF) and compare the performance with dictionary-based approaches using several terminological resources. For the training and evaluation, a gold standard of manually curated patent documents was used. While the dictionary-based systems perform well in partial identification of chemical entities, the machine learning approach performs better (10% increase in F-score in comparison to the best dictionary-based system) when identifying complete entities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Recognizing chemicals in patents: a comparative analysis

Article Open access 28 October 2016

Entity recognition in the biomedical domain using a hybrid approach

Article Open access 09 November 2017

Development of Text Mining Tools for Information Retrieval from Patents

References

Yeh, A., Hirschman, L., Morgan, A.: Evaluation of text data mining for database curation: Lessons learned from the KDD challenge cup. Bioinformatics 19(1), i331–i339 (2003)
Article Google Scholar
Hersh, W., Cohen, A., Roberts, P., Rekapalli, H.: TREC 2006 genomics track overview. In: Proc. of the 15th Text REtrieval Conference (2006)
Google Scholar
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, S1 (2005)
Article Google Scholar
Hirschman, L., Krallinger, M., Valencia, A.: Proc. of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (2007)
Google Scholar
Smith, L., Tanabe, L., Ando, R., Kuo, C., Chung, I., Hsu, C., Lin, Y., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W., Hunter, L., Carpenter, B., Tsai, R., Dai, H., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, M., Mana-Lopez, A., Mata-Vazquez, J., Wilbur, W.: Overview of BioCreative II gene mention recognition. Genome Biology 9(suppl. 1), S2 (2008)
Article Google Scholar
Reyle, U.: Understanding chemical terminology. Terminology 12, 111–126 (2006)
Article Google Scholar
Hanisch, D., Fundel, K., Mevissen, H., Zimmer, R., Fluck, J.: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 6(suppl. 1), S14 (2005)
Article Google Scholar
Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., Stoehr, P.: Ebimed - text crunching to gather facts for proteins from medline. Bioinformatics 23 (2007)
Google Scholar
Narayanaswamy, M., Ravikumar, K., Vijay-Shanker, K.: A biological named entity recognizer. In: Proc. of the Pacific Symposium on Biocomputing, pp. 427–438 (2003)
Google Scholar
Kemp, N., Lynch, M.: The extraction of information from the text of chemical patents. 1. identification of specific chemical names. J. Chem. Inf. Comput. Sci. 38, 544–551 (1998)
Article Google Scholar
Corbett, P., Murray-Rust, P.: High-throughput identification of chemistry in life science texts. In: Berthold, M.R., Glen, R.C., Fischer, I. (eds.) CompLife 2006. LNCS (LNBI), vol. 4216, pp. 107–118. Springer, Heidelberg (2006)
Chapter Google Scholar
Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alcantara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350 (2008)
Article Google Scholar
Corbett, P., Copestake, A.: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 9(suppl. 11), S4 (2008)
Article Google Scholar
Klinger, R., Kolá, C., Fluck, J., Hofmann-Apitius, M., Friedrich, C.: Detection of IUPAC and IUPAC-like chemical names. ISMB 2008. Bioinformatics 24, i268–i276 (2008)
Article Google Scholar
International Union of Pure and Applied Chemistry, http://www.iupac.org
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), i180–i182 (2003)
Article Google Scholar
Wishart, D., Knox, C., Guo, A., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 34, D668–D672 (2006)
Article Google Scholar
Corbett, P.: OSCAR3 (Open Source Chemistry Analysis Routines) - software for the semantic annotation of chemistry papers, http://sourceforge.net/projects/oscar3-chem
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)
Google Scholar
McCallum, A.: MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu

Download references

Author information

Authors and Affiliations

Faculty of Sciences, University of Lisbon, Campo Grande, 1749-016, Lisboa, Portugal
Tiago Grego & Francisco M. Couto
EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Piotr Pęzik & Dietrich Rebholz-Schuhmann

Authors

Tiago Grego
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Pęzik
View author publications
You can also search for this author in PubMed Google Scholar
Francisco M. Couto
View author publications
You can also search for this author in PubMed Google Scholar
Dietrich Rebholz-Schuhmann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Engineering, Osaka Prefecture University, Osaka, Japan
Sigeru Omatu
Department of Informatics / CCTC, University of Minho, Braga, Portugal
Miguel P. Rocha
MAmI Research Lab, University of Castilla-La Mancha,, Ciudad Real, Spain
José Bravo
Department of Informatics, University of Vigo, Ourense, Spain
Florentino Fernández
Grupo de Investigación GICAP, Área de Lenguajes Higher Polytechnic School, Universidad de Burgos, Burgos, Spain
Emilio Corchado
Higher Polytechnic School, University of Burgos, Burgos, Spain
Andrés Bustillo
Department of Informatics, University of Salamanca, Salamanca, Spain
Juan M. Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grego, T., Pęzik, P., Couto, F.M., Rebholz-Schuhmann, D. (2009). Identification of Chemical Entities in Patent Documents. In: Omatu, S., et al. Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living. IWANN 2009. Lecture Notes in Computer Science, vol 5518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02481-8_144

Download citation

DOI: https://doi.org/10.1007/978-3-642-02481-8_144
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02480-1
Online ISBN: 978-3-642-02481-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics