Skip to main content

Inducing Context Gazetteers from Encyclopedic Databases for Named Entity Recognition

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7818))

Included in the following conference series:

  • 3901 Accesses

Abstract

Named entity recognition (NER) is a fundamental task for mining valuable information from unstructured and semi-structured texts. State-of-the-art NER models mostly employ a supervised machine learning approach that heavily depends on local contexts. However, results of recent research have demonstrated that non-local contexts at the sentence or document level can help advance the improvement of recognition performance. As described in this paper, we propose the use of a context gazetteer, the list of contexts with which entity names can co-occur, as new non-local context information. We build a context gazetteer from an encyclopedic database because manually annotated data are often too few to extract rich and sophisticated context patterns. In addition, dependency path is used as sentence level non-local context to capture more syntactically related contexts to entity mentions than linear context in traditional NER. In the discussion of experimentation used for this study, we build a context gazetteer of gene names and apply it for a biomedical NER task. High confidence context patterns appear in various forms. Some are similar to a predicate–argument structure whereas some are in unexpected forms. The experiment results show that the proposed model using both entity and context gazetteers improves both precision and recall over a strong baseline model, and therefore the usefulness of the context gazetteer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bodenreider, O.: The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Research 32(suppl. 1), D267–D270 (2004)

    Article  Google Scholar 

  2. Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Nyu: Description of the mene named entity system as used in muc-7. In: Proceedings of the Seventh Message Understanding Conference, MUC-7 (1998)

    Google Scholar 

  3. Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Journal of Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  4. Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 160–163 (2003)

    Google Scholar 

  5. Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the Seventh Message Understanding Conference (MUC7) (April 1998)

    Google Scholar 

  6. Consortium, T.U.: Reorganizing the protein space at the universal protein resource (uniprot). Nucleic Acids Research 40(D1), D71–D75 (2012)

    Article  Google Scholar 

  7. Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., Sinclair, G.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on NLPBA, pp. 88–91 (2004)

    Google Scholar 

  8. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on ACL, pp. 363–370 (2005)

    Google Scholar 

  9. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 168–171 (2003)

    Google Scholar 

  10. Kambhatla, N.: Minority vote: at-least-n voting improves recall for extracting relations. In: Proceedings of COLING-ACL, pp. 460–466 (2006)

    Google Scholar 

  11. Kazama, J., Torisawa, K.: Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations. In: Proceedings of ACL-HLT, pp. 407–415 (2008)

    Google Scholar 

  12. Kim, J.D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., Tsujii, J.: Overview of bionlp shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 1–6 (2011)

    Google Scholar 

  13. Krishnan, V., Manning, C.D.: An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: Proceedings of COLING-ACL, pp. 1121–1128 (2006)

    Google Scholar 

  14. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  15. Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on svms. Journal of Biomedical Informatics 37(6), 436–447 (2004)

    Article  Google Scholar 

  16. Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez gene: Gene-centered information at ncbi. Nucleic Acids Research 33(suppl. 1), D54–D58 (2005)

    Google Scholar 

  17. Marneffe, M.C.D., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC 2006 (2006)

    Google Scholar 

  18. Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Susan Dumais, D.M., Roukos, S. (eds.) Proceedings of HLT-NAACL, May 2-May 7, pp. 337–342 (2004)

    Google Scholar 

  19. Okazaki, N.: Crfsuite: A fast implementation of conditional random fields, crfs (2007), http://www.chokkan.org/software/crfsuite/

  20. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on CoNLL, pp. 147–155 (2009)

    Google Scholar 

  21. Riloff, E., Shepherd, J.: A corpus-based approach for building semantic lexicons. In: Proceedings of the Second Conference on EMNLP, pp. 117–124 (1997)

    Google Scholar 

  22. Smith, L., Tanabe, L., Ando, R., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin, Y.S., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W., Hunter, L., Carpenter, B., Tsai, R., Dai, H.J., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, A., Mana-Lopez, M., Mata, J., Wilbur, W.J.: Overview of biocreative ii gene mention recognition. Genome Biology 9(suppl. 2), S2 (2008)

    Article  Google Scholar 

  23. Smith, L.H., Wilbur, W.J.: Value of parsing as feature generation for gene mention recognition. Journal of Biomedical Informatics 42(5), 895–904 (2009)

    Article  Google Scholar 

  24. Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.J.: Genetag: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6(suppl. 1), S3 (2005)

    Article  Google Scholar 

  25. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)

    Google Scholar 

  26. Torisawa, K.: Exploiting wikipedia as external knowledge for named entity recognition. In: Proceedings of the Joint Conference on EMNLP-CoNLL, pp. 798–707 (2007)

    Google Scholar 

  27. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the HLT-NAACL, vol. 1, pp. 173–180 (2003)

    Google Scholar 

  28. Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of the Conference on HLT-EMNLP, pp. 467–474 (2005)

    Google Scholar 

  29. Usami, Y., Cho, H.C., Okazaki, N., Tsujii, J.: Automatic acquisition of huge training data for bio-medical named entity recognition. In: Proceedings of BioNLP 2011 Workshop, pp. 65–73 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cho, HC., Okazaki, N., Inui, K. (2013). Inducing Context Gazetteers from Encyclopedic Databases for Named Entity Recognition. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7818. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37453-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37453-1_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37452-4

  • Online ISBN: 978-3-642-37453-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics