Abstract
Research has shown that topic-oriented words are often related to named entities and can be used for Named Entity Recognition. Many have proposed to measure topicality of words in terms of ‘informativeness’ based on global distributional characteristics of words in a corpus. However, this study shows that there can be large discrepancy between informativeness and topicality; empirically, informativeness based features can damage learning accuracy of NER. This paper proposes to measure words’ topicality based on local distributional features specific to individual documents, and proposes methods to transform topicality into gazetteer-like features for NER by binning. Evaluated using five datasets from three domains, the methods have shown consistent improvement over a baseline by between 0.9 and 4.0 in F-measure, and always outperformed methods that use informativeness measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ahmed, K., Gillam, L., Tostevin, L.: University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER). In: The 8th Text Retrieval Conference, TREC-8 (1999)
Chang, J., Schütze, H., Altman, R.: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–225 (2004)
Church, K., Gale, W.: Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In: Proceedings of the 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, pp. 121–130 (1995a)
Church, K., Gale, W.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995b)
Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. In: Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases, pp. 949–964 (1999)
Collier, N., Nobata, C., Tsujii, J.: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In: Proceedings of COLING 2000, pp. 201–207 (2000)
Dagan, I., Church, K.: Termight: Identify-ing and Translating Technical Terminology. In: Proceedings of EACL, pp. 34–40 (1994)
Downey, D., Broadhead, M., Etzioni, O.: Locating Complex Named Entities in Web Text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (2007)
Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A brief history. In: Proceedings of the 16th International Conference on Computational Linguistics (1996)
Gupta, S., Bhattacharyya, P.: Think Globally, Apply Locally: Using Distributional Characteristics for Hindi Named Entity Identification. In: Proceedings of the 2010 Named Entities Workshop, ACL 2010, pp. 116–125 (2010)
Harter, S.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science 26(4), 197–206 (1975)
Hassel, M.: Exploitation of Named Entities in Automatic Text Summarization for Swedish. In: Proceedings of the 14th Nordic Conference on Computational Linguistics (2003)
Jones, K.: Index term weighting. Information Storage and Retrieval 9(11), 619–633 (1973)
Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the Bio-Entity Recognition Task at JNLPBA. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)
Mizzaro, S.: Relevance: The Whole History. Journal of the American Society for Information Science 48(9), 810–832 (1997)
Morgan, A., Hirschman, L., Yeh, A., Colosimo, M.: Gene Name Extraction Using FlyBase Resources. In: ACL 2003 Workshop on Language Processing in Biomedicine, Sapporo, Japan, pp. 1–8 (2003)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Rennie, J., Jaakkola, T.: Using Term Informativeness for Named Entity Detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2005)
Saha, S., Sarkar, S., Mitra, P.: Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics 42(5), 905–911 (2009)
Silva, J., Kozareva, Z., Noncheva, V., Lopes, G.: Extracting Named Entities: A Statistical Approach. In: Proceeding of TALN (2004)
Tjong, E., Sang, K., Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)
Wan, X., Zhong, L., Huang, X., Ma, T., Jia, H., Wu, Y., Xiao, J.: Named Entity Recognition in Chinese News Comments on the Web. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 856–864 (2011)
Zhang, L., Pan, Y., Zhang, T.: Focused Named Entity Recognition using Machine Learning. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)
Zhang, Z., Iria, J.: A Novel Approach to Automatic Gazetteer Generation using Wikipedia. In: Proceedings of the ACL 2009 Workshop on Collaboratively Constructed Semantic Resources (2009)
Zhang, Z., Iria, J., Ciravegna, F.: Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction. In: Proceedings of LREC 2010, Malta (May 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Z., Cohn, T., Ciravegna, F. (2013). Topic-Oriented Words as Features for Named Entity Recognition. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)