Combining data-driven systems for improving Named Entity Recognition

doi:10.1016/j.datak.2006.06.014

Data & Knowledge Engineering

Volume 61, Issue 3, June 2007, Pages 449-466

https://doi.org/10.1016/j.datak.2006.06.014 Get rights and content

Abstract

The increasing flow of digital information requires the extraction, filtering and classification of pertinent information from large volumes of texts. All these tasks greatly benefit from involving a Named Entity Recognizer (NER) in the preprocessing stage. This paper proposes a completely automatic NER system. The NER task involves not only the identification of proper names (Named Entities) in natural language text, but also their classification into a set of predefined categories, such as names of persons, organizations (companies, government organizations, committees, etc.), locations (cities, countries, rivers, etc.) and miscellaneous (movie titles, sport events, etc.). Throughout the paper, we examine the differences between language models learned by different data-driven classifiers confronted with the same NLP task, as well as ways to exploit these differences to yield a higher accuracy than the best individual classifier. Three machine learning classifiers (Hidden Markov Model, Maximum Entropy and Memory Based Learning) are trained on the same corpus in order to resolve the NE task. After comparison, their output is combined using voting strategies. A comprehensive study and experimental work on the evaluation of our system, as well as a comparison with other systems has been carried out within the framework of two specialized scientific competitions for NER, CoNLL-2002 and HAREM-2005. Finally, this paper describes the integration of our NER system in different NLP applications, in concrete Geographic Information Retrieval and Conceptual Modelling.

Introduction

The vision of the information society as a global digital community is fast becoming a reality. Progress is being driven by innovation in business and technology, and the convergence of computing, telecommunications and information systems. Access to knowledge resources in the information society is vital to both our professional and personal development. However, access alone is not enough. We need to be able to select, classify, assimilate, retrieve, filter and exploit this information, in order to enrich our collective and individual knowledge and skills. This is a key area of application for language technologies. The approach taken in this area is to develop advanced applications characterized by more intuitive natural language interfaces and content-based information analysis, extraction and filtering. Natural Language Processing (NLP) is crucial in solving these tasks. In concrete, Named Entity Recognition (NER) has emerged as an important preprocessing tool for many NLP applications such as Information Extraction (IE), Information Retrieval (IR) among other text processing applications, mainly because Name Entities (NEs) provide important cues for identifying relevant information in text.

One example of application that benefits greatly from NER is Question Answering, as, in order to answer questions like “Who is the president of the Spanish Government?”, it is useful to know that the expected answer is the name of a person, and, obviously, to consider as candidate answers only entities of this type. Accurate NE detection can also help IE systems to gain knowledge about where an event happened, who was involved in it, etc. Developing and populating ontologies is another task to be largely improved by NER. Even in automatic semantic role labelling, the extra information about an item being a person or an artifact could be crucial to determine what entity denominates the agent, depending on the type of the verb. In the area of Conceptual Modelling, a NER module is used to extract functional requirements from user’s documents written in natural language.

As can be seen, NER has crucial importance for the performance of various NLP applications. To gain a better insight into this area, we will further present a short history, together with different approaches adopted in the identification and classification of NEs.

The NER task has been first introduced by the sixth Message Understanding Conference (MUC-6). The MUC Conferences [1] have been established in 1987 with the aim of evaluating Information Extraction Systems. During MUC-6, a definition and description of the term Named Entity (NE) was introduced, together with measures for the accuracy of a system performing NER. NER involves processing a text and identifying certain occurrences of words or expressions as belonging to a particular category of Named Entities (NEs) such as names of persons, organizations, locations and numeric expressions like money or percent expressions. In fact, NER has two main sub-problems. One is NE detection (NED), that is the identification of the portion of text that forms a NE, (for example, “El Presidente Rodriguez Zapatero”¹) and the second is NE classification (NEC), the process of assigning a category label to the identified span of text (the system would assign the tag person to the NE “El Presidente Rodriguez Zapatero”).

Conferences such as MUC, CoNLL² or ACE³ proposed a limited set of categories (for example, person, organization, location, and numeric expressions like money or percent expressions), but there are many more possible labels ([2] proposed an extended Named Entity hierarchy which contains about 150 NE types). So far the research community does not commonly agree on the relationship between the number of classes and the domain of application.

Normally, an automatic NE tagger follows one of two possible approaches: the approach that employs dictionaries and hand-made rules (knowledge-based systems like the one presented in Arevalo et al. [3]) or the approach based on supervised learning techniques. The difficulty of building accurate systems following the first approach has lead to many researchers showing recently increased interest in supervised learning techniques. Supervised learning has the advantage to acquire automatically linguistic knowledge from a pool of correctly hand-annotated examples. Therefore, this type of techniques are easy to adapt to another domain or languages, but they depend on appropriate corpus. This is the main limitation of these approaches, because annotated data is not usually available. Supervised learning techniques employed so far in NER include: Decision Tree [4], Hidden Markov Model [5], [6], Maximum Entropy Model [7], Support Vector Machine [8], Boosting and voted perceptron [9], AdaBoost [10], Conditional Random Fields [11] and Finite State Automata [12].

In order to detect and classify the NEs, typically these NER systems use the surrounding context or syntactically annotated text. But it is also usual to use external resources such as lists of trigger words⁴ or gazetteers.⁵ In general, the larger the set of defined features is, the more information a classifier possesses to perform its task. Some problems that may arise are related to domain or language dependency, or over-fitting.

In fact, NER suffers the same limitations as many other NLP tasks, the bottleneck of knowledge acquisition both for corpus-based and knowledge-based systems. Knowledge-based systems rely on manually defined rules and dictionaries. This poses the difficulty of being successfully tuned to new domains or applications. On the other hand, corpus-based approaches are dependent on the availability of properly annotated corpora even when a restricted domain is needed. In both cases the high cost of acquiring such knowledge (explicit or in the form of sets of examples) is a serious handicap. Several attempts have been made to decrease the effort in acquisition and application of knowledge for NER.

The Muse system [13] is a multi-purpose NER system which focuses on this matter. One of its properties is the ability to deal with heterogeneous sources of text. [14] proposed to use active learning techniques in order to minimize the human annotation effort by automatically selecting the most useful examples for training. For the NE classification task, [15] presented a pair of algorithms based on co-training [16] that use seven simple “seed” rules. This approach is based on natural redundancy in data because, they say, both spelling and the context in which the name appears is sufficient to determine its type. Three classes of entities are location, person and organization. Other approaches based on the usage of unlabeled data were presented by [17], [18]. They applied self-training techniques to resolve the NE detection task and proposed a new voted co-training method for classify the NEs. Compared to [15] this approach did not need a split of features, but rather different machine learning classification methods that are combined through voting. The NER has been done for Spanish and the identified NEs have been classified into location, organization, person and miscellaneous categories.

Also by means of a bootstrapping technique, [19] studied the role of syntax-rich features such as constituency and dependency in order to iteratively feed a semi-supervised system based on Expectation Maximization(EM) with labeled and unlabeled data. Other approaches combine supervised and unsupervised methods with multilinguality such as [20] did for Catalan and Spanish using both AdaBoost.MH and Greedy Agreement Algorithm [21]. This system takes advantage of the syntactic similarity between the two languages. Their conclusion is that, by using multilingual resources, one can clearly outperform other approaches.

A common approach for NER and other NLP tasks, is related to the combination of several methods and classifiers that improve the performance. For example, [22] solved NER using a combination of four classifiers: robust linear classification, Maximum Entropy, transformation-based learning and Hidden Markov Model. [23] presented a system based on stacking and voting of strong classifiers, such as Support Vector Machine, boosting and memory-based learning.

Our system has been developed following this last approach. We combined several machine learning methods and linguistic knowledge resources. The main goals of the work presented in this paper are:

•
to propose a completely automatic NER which involves the identification of proper names in texts, and their classification into a set of predefined categories;
•
to examine the differences between language models learned by different data-driven systems performing the same NLP tasks and how they can be exploited to yield a higher accuracy than the best individual system;
•
to use voting strategy to combine effectively strong classifiers such as Hidden Markov Models, Maximum Entropy and Memory-based;
•
to evaluate the proposed NER system in two specialized scientific competitions for NER, such as CoNLL-2002 and HAREM-2005;
•
to utilize our NER system in a NLP application such as IR;
•
to evaluate the applicability of our system in the area of Conceptual Modelling.

The organization of the paper is the following: the system is described in Section 2, the conducted experiments and a discussion of the obtained results follow in Sections 3, Section 4 describes the combination of the classifiers, Section 5 presents a comparison with other NER systems, followed by the integration of the system within an IR application and a Conceptual Modelling application (Section 6). Finally, we conclude in Section 7 with a summary of the most important achievements and plans for future work.

Section snippets

NERUA system description

In the following subsections, we describe the Named Entity Recognition system developed at the University of Alicante and called NERUA⁶ [24]. It is composed of two main modules, each module corresponds to a NER subtask:

•
entity detection – the identification of a sequence of words that makes up the name of an entity;
•
entity classification – the assignation of a category (LOCation, ORGanization, PERson or MISCellaneous) to each detected

Experimental setup and discussion of results

The initial experiments were conducted for Spanish using the labelled train and test data of the CoNLL-2002 [31] competition. The train corpus contains 264 715 tokens, out of which 18 794 are entities. The development corpus sums 52 923 tokens out of which 4351 are entities, and the test corpus contains 51 533 tokens and 3558 entities. Scores were computed per NE class with the help of conlleval¹⁰ script. The measures are $Precision = number of correct$

Classifier combination

In order to develop a system that outperforms the best individual classifiers, we combined the individual classifiers. According to [35], the simplest approach for classifier combination is voting. The output of various ML algorithms is examined and classifiers with weight exceeding a certain threshold are selected. Normally the weight is dependent upon the models that proposed a particular classification.

It is possible to assign various weights to the classifiers, in effect giving more

System comparison

In this section, we describe NERUA’s performance compared to other existing systems that participated in the CoNLL-2002 and HAREM-2005 competitions.

Applications

In this section, we want to show the applicability of NERUA in real applications such as Information Retrieval (IR) and Conceptual Modelling (CM). For IR, NERUA has been applied in order to support the IR module in retrieving relevant documents and this application has been proved in the GeoCLEF 2005. In what CM is concerned, NERUA has been used to extract functional requirements. In the followings subsections these applications are presented in detail.

Conclusions and future work

In this paper, we presented a completely automatic Named Entity Recognition approach, based on three different machine learning techniques (Memory-based learning, Hidden Markov Models and Maximum Entropy). We have examined the performances of each method individually for the NER task, as well as we studied the best combination among them using various voting strategies.

The system was initially developed for Spanish, but afterwards it was easily adapted to other languages, such as Portuguese and

Zornitsa Petrova Kozareva (1982) is a Ph.D. student in Computer Science at the University of Alicante, Spain. Since 2004 she works in the Department of Software and Computing Systems (GPLSI division) as a researcher with a fellowship from the Rector of Investigation at the University of Alicante. Her research interests are focused on Natural Language Processing and Machine Learning, in concrete Named Entity Recognition and Textual Entailment. She has published 15 papers in International

References (43)

R. Grishman, B. Sundheim, Message understanding conference-6: a brief history, in: Proceedings of the 16th Conference...
S. Sekine, K. Sudo, C. Nobata, Extended named entity hierarchy, in: Proceedings of the Third International Conference...
M. Arevalo et al.
Mice: a module for named entity recognition and classification
International Journal of Corpus Linguistics
(2004)
S. Sekine, R. Grishman, H. Shinnou, A decision tree method for finding and classifying names in japanese texts, in:...
D.M. Bikel et al.
Nymble: a high-performance learning name-finder
G. Zhou, J. Su, Named entity recognition using an hmm-based chunk tagger, in: ACL ’02: Proceedings of the 40th Annual...
A. Borthwick, J. Sterling, E. Agichtein, R. Grishman, Exploiting diverse knowledge sources via maximum entropy in named...
M. Asahara, Y. Matsumoto, Japanese named entity extraction with redundant morphological analysis, in: NAACL ’03:...
M. Collins, Ranking algorithms for named-entity extraction: boosting and the voted perceptron, in: ACL ’02: Proceedings...
X. Carreras, L. Màrques, L. Padró, Named entity extraction using adaboost, in: Proceedings of CoNLL-2002, Taipei,...

A. McCallum, W. Li, Early results for named entity recognition with conditional random fields, feature induction and...

M. Padró et al.

A named entity recognition system based on a finite automata

Procesamiento del Lenguaje Natural

(2005)

D. Maynard, V. Tablan, C. Ursu, H. Cunningham, Y. Wilks, Named entity recognition from diverse text types, in: R....

D. Shen, J. Zhang, J. Su, G. Zhou, C.L. Tan, Multi-criteria-based active learning for named entity recognition, in:...

M. Collins, Y. Singer, Unsupervised models for named entity classification, in: Proceedings of the Joint SIGDAT...

A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: COLT’ 98: Proceedings of the Eleventh...

Z. Kozareva, B. Bonev, A. Montoyo, Self-training and co-training applied to spanish named entity recognition, in:...

Z. Kozareva, A. Montoyo, Learning spanish named entities using unlabeled data, in: G. Angelova, K. Bontcheva, R....

B. Mohit, R. Hwa, Syntax-based semi-supervised named entity tagging, in: Proceedings of the ACL Interactive Poster and...

L. Márquez, A. de Gispert, X. Carreras, L. Padró, Low-cost named entity classification for catalan: exploiting...

S. Abney, Bootstrapping, in: ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational...

Cited by (24)

Event identification in web social media through named entity recognition and topic modeling
2013, Data and Knowledge Engineering
Citation Excerpt :
Two are the dominant approaches followed in NER tasks; the first is a knowledge-based approach that uses explicit resources like hand-crafted rules and gazetteers, while the second is a dynamic approach, where a tagged corpus is used to train a supervised learning algorithm. We propose employing both methods, since their combination has shown to exhibit the best results in NER tasks [34]. The objective of this step is to discover sets of topics, as expressed by a stream of documents that identify their semantic content of those documents and express the semantic similarity among them.
The problem of identifying important online or real life events from large textual document streams that are freely available on the World Wide Web is increasingly gaining popularity, given the flourishing of the social web. An event triggers discussion and comments on the WWW, especially in the blogosphere and in microblogging services. Consequently, one should be able to identify the involved entities, topics, time, and location of events through the analysis of information publicly available on the web, create semantically rich representations of events, and then use this information to provide interesting results, or summarize news to users.
In this paper, we define the concept of important event and propose an efficient methodology for performing event detection from large time-stamped web document streams. The methodology successfully integrates named entity recognition, dynamic topic map discovery, topic clustering, and peak detection techniques. In addition, we propose an efficient algorithm for detecting all important events from a document stream. We perform extensive evaluation of the proposed methodology and algorithm on a dataset of 7 million blogposts, as well as through an international social event detection challenge. The results provide evidence that our approach: a) accurately detects important events, b) creates semantically rich representations of the detected events, c) can be adequately parameterized to correspond to different social perceptions of the event concept, and d) is suitable for online event detection on very large datasets. The expected complexity of the online facet of the proposed algorithm is linear with respect to the number of documents in the data stream.
A comparative study of classifier combination applied to nlp tasks
2013, Information Fusion
Citation Excerpt :
Again, after several experiments using the Penn Treebank corpus, the option of using a second level of learning obtained the best results. Other works have employed combination methods for improving performance in POS tagging [12–14], and also in many other tasks such as word sense disambiguation [15–17], named entity recognition [18–20], different types of parsing [21,22], document classification [23,24], information extraction [25,26], and opinion extraction [27,28] among others. However, the coverage of the combination methods applied to NLP tasks has been quite limited, showing a clear tendency to use voting and stacking techniques against other methods.
The paper is devoted to a comparative study of classifier combination methods, which have been successfully applied to multiple tasks including Natural Language Processing (NLP) tasks. There is variety of classifier combination techniques and the major difficulty is to choose one that is the best fit for a particular task. In our study we explored the performance of a number of combination methods such as voting, Bayesian merging, behavior knowledge space, bagging, stacking, feature sub-spacing and cascading, for the part-of-speech tagging task using nine corpora in five languages. The results show that some methods that, currently, are not very popular could demonstrate much better performance. In addition, we learned how the corpus size and quality influence the combination methods performance. We also provide the results of applying the classifier combination methods to the other NLP tasks, such as name entity recognition and chunking. We believe that our study is the most exhaustive comparison made with combination methods applied to NLP tasks so far.
Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization
2008, Information Sciences
This paper presents an improvement in the temporal expression (TE) recognition phase of a knowledge based system at a multilingual level. For this purpose, the combination of different approaches applied to the recognition of temporal expressions are studied. In this work, for the recognition task, a knowledge based system that recognizes temporal expressions and had been automatically extended to other languages (TERSEO system) was combined with a system that recognizes temporal expressions using machine learning techniques. In particular, two different techniques were applied: maximum entropy model (ME) and hidden Markov model (HMM), using two different types of tagging of the training corpus: (1) BIO model tagging of literal temporal expressions and (2) BIO model tagging of simple patterns of temporal expressions. Each system was first evaluated independently and then combined in order to: (a) analyze if the combination gives better results without increasing the number of erroneous expressions in the same percentage and (b) decide which machine learning approach performs this task better. When the TERSEO system is combined with the maximum entropy approach the best results for F-measure (89%) are obtained, improving TERSEO recognition by 4.5 points and ME recognition by 7.
An Algorithm for Automatic Text Annotation for Named Entity Recognition using spaCy Framework
2023, Research Square
Chinese medical knowledge mining and analysis based on syntactic dependency and named entity recognition
2022, Proceedings - 2022 International Symposium on Advances in Informatics, Electronics and Education, ISAIEE 2022
Intelligent Collection and Semantic Matching Algorithm for English-Chinese Corpus Based on Cluster Data Mining
2022, Proceedings of the 2nd International Conference on Artificial Intelligence and Smart Energy, ICAIS 2022

View all citing articles on Scopus

Óscar Ferrández Escámez (1980). Ph.D. student in Computer Science by the University of Alicante. He is working since 2004 in the Department of Software and Computing Systems (GPLSI division) at this University as a researcher with a fellowship from the Valencia Government. His research interests are focused on Computational Linguistics and Natural Language Processing and more precisely in Named Entity Recognition and Textual Entailment. He has already published nine papers in International Conferences.

Andrés Montoyo is full Professor and member of the Department of Software and Computing Systems (GPLSI division) (since 1992). His scientific interests include lexical and semantic analysis, word sense disambiguation, Name Entity Recognition, information extraction and information retrieval applications. He has been the main researcher in the development of different Word Sense Disambiguation systems (Specification Marks) and cross-language Geographic Information Retrieval (GIR) systems (GIRUA) that have been evaluated in several CLEF evaluations. He is editor of the special issue on Natural Language and Information Systems held by the journal Data and Knowledge Engineering. He has been a member of programme committees of several international conferences (NLDB, RANLP). He is responsible for several national research projects and has edited several books, and contributed more than 40 papers to several journals and conferences.

Rafael Muñoz (1967) is working since 1996 in Software and Computing Systems at the University of Alicante as a researcher and lecturer. Moreover, he is the head of Technology Transfer Office of University of Alicante. His research areas are: Computer Linguistics and Natural Language Processing and in concrete Information Extraction and Retrieval, Question Answering, Temporal information applied to Question Answering, Named Entity Recognition and Anaphora resolution (definite description). He was conference chairman in several conferences to note: NLDB’05. He is a member of the Research Group on Language Processing and Information systems at the University of Alicante since 1996 and member of Natural Language Processing Spanish Society. He wrote two book, edited three books and published more than 40 papers in journals and conferences to note Computational Linguistics and DKE journal and international conferences like ACL in 2004.

Armando Suárez (1965) is working since 1992 in Software and Computing Systems at the University of Alicante as a researcher and lecturer. His research areas are: Computer Linguistics and Natural Language Processing and in concrete Word Sense Disambiguation and semantic annotation. He is a member of the Research Group on Language Processing and Information systems at the University of Alicante since 1996 and member of Natural Language Processing Spanish Society. He published more than 40 papers in journals and conferences to note Journal of Artificial Intelligence Research and international conferences like COLING in 2002.

Jaime Gómez is a Professor of Software Engineering at the Computer Science School at the University of Alicante. He received a B.S. in Computer Science from Technical University of Valencia in 1992 and Ph.D. in Computer Science from the University of Alicante in 1999. His current research interests include, Application of NLP to Web Engineering, Semantic Web and Model-Driven Engineering. For more than eight years, his research group has develop methods and tools to facilitate software development in Industry. He is author of more than 50 papers in Journals and International Conferences.

^☆: This research has been funded by the Spanish Government under project CICyT number TIC2003-07158-C04-01, TIN2004-00779 and PROFIT number FIT-340100-2004-14, CESS-ECE number HUM2004-21127-E and by the Valencia Government under project numbers GV04B-276 and GV04B-268.

View full text

Combining data-driven systems for improving Named Entity Recognition☆