Abstract
A major constraint of machine learning techniques for solving several information extraction problems is the availability of sufficient amount of training examples, which involve huge costs and efforts to prepare. Active learning techniques select informative instances from the unlabeled data and add it to the training set in such a way that the overall classification performance improves. In random sampling approach, unlabeled data is selected for annotation at random and thus can’t yield the desired results. In contrast, active learning selects the useful data from a huge pool of unlabeled documents. The strategies used often classify the instances to belong to the incorrect classes. The classifier is confused between two classes if the test instance is located near the margin. We propose two methods for active learning, and show that these techniques favorably result in the increased performance. The first approach is based on support vector machine (SVM), whereas the second one is based on an ensemble learning which utilizes the classification capabilities of two well-known classifiers, namely SVM and conditional random field. The motivation of using these classifiers is that these are orthogonal in nature, and thereby a combination of them can produce the better results. In order to show the efficacy of the proposed approach we choose a crucial problem, namely named entity recognition (NER) in three languages, namely Bengali, Hindi and English. This is also evaluated for NER in biomedical domain. Evaluation results reveal that the proposed techniques indeed show considerable performance improvements.
Similar content being viewed by others
Notes
Here by extraction we mean both recognition and classification.
We iterate the algorithm for more than 10 iterations as we observed performance improvement even in the 10th iteration.
References
Dligach D, Palmer M (2011) Good seed makes a good crop: accelerating active learning using language modeling. In: Proceedings of the 49th annual meeting of the association for computational linguistics: shortpapers, Portland, Oregon. Association for Computational Linguistics, pp 6–10
Dligach D, Palmer M (2009) Using language modeling to select useful annotation data. In: Proceedings of human language technologies, Portland, Oregon. Association for Computational Linguistics, pp 25–30
Laws F, Heimer F, Sch\(\ddot{u}\)tze H (2012) Active learning for coreference resolution. In: 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montreal, Canada. Association for Computational Linguistics, pp 508–512
Settles B (2009) Active learning literature survey. In: Computer sciences technical report 1648
Ekbal A, Bonin F, Saha S, Stemle E, Barbu E, Cavulli F, Girardi C, Nardelli F, Poesio M (2012) Rapid adaptation of ne resolvers for humanities domains using active annotation. J Lang Technol Comput Linguist (JLCL) 26(2):26–38
Small K, Roth D (2010) Margin-based active learning for structured predictions. Int J Mach Learn Cybern 1(1–4):3–25
Wang XZ, Dong LC, Yan JH (2012) Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng 24(8):1491–1505
Settles B (2008) Curious machines: active learning with structured instances. PhD thesis, University of Wisconsin-Madison
Tong S (2001) Active learning: theory and applications. PhD thesis, Stanford University
Monteleoni C (2006) Learning with online constraints: shifting concepts and active learning. PhD thesis, Massachusetts Institute of Technology
Olsson F (2008) Bootstrapping named entity recognition by means of active machine learning. PhD thesis, University of Gothenburg
Olsson F (2009) A literature survey of active machine learning in the context of natural language processing. In: Technical report t2009:06, Swedish Institute of Computer Science
Schein AI, Ungar LH (October 2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265
Baldridge J, Palmer A (2009) How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP ’09) vol 1, Stroudsburg. Association for Computational Linguistics, pp 296–305
Tomanek K, Olsson F (2009) A web survey on the use of active learning to support annotation of text data. In: Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing, HLT ’09, Stroudsburg. Association for Computational Linguistics, pp 45–48
Dasgupta S (2004) Analysis of a greedy active learning strategy. In: Advances in neural information processing systems. MIT Press, USA, pp 337–344
Balcan MF, Hanneke S, Vaughan J (2010) The true sample complexity of active learning. Mach Learn 80(2–3):111–139
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP’08), Stroudsburg. Association for Computational Linguistics, pp 1070–1079
Reichart R, Tomanek K, Hahn U, Rappoport A (2008) Multi-task active learning for linguistic annotations. In: Proceedings of ACL-08: HLT, Columbus, Ohio. Association for Computational Linguistics, pp 861–869
Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence (AAAI’99/IAAI ’99), Menlo Park. American Association for Artificial Intelligence, pp 474–479
Cucchiarelli A, Velardi P (March 2001) Unsupervised named entity recognition using syntactic and semantic contextual evidence. Comput Linguist 27(1):123–131
Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (June 2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1):91–134
Tomanek K, Hahn U (2009) Reducing class imbalance during active learning for named entity annotation. In: Proceedings of the fifth international conference on knowledge capture (K-CAP’09), New York. ACM, pp 105–112
Becker M, Hachey B, Alex B, Grover C (2005) Optimising selective sampling for bootstrapping named entity recognition. In: Proceedings of the ICML workshop on learning with multiple views, pp 5–11
Yao L, Sun C, Li S, Wang X, Wang X (2009) Crf-based active learning for chinese named entity recognition. In: SMC, IEEE, pp 1557–1561
Laws F, Schätze H (2008) Stopping criteria for active learning of named entity recognition. In: Proceedings of the 22nd international conference on computational linguistics (COLING’08), vol 1, Stroudsburg. Association for Computational Linguistics, pp 465–472
Shen D, Zhang J, Su J, Zhou G, Tan CL (2004) Multi-criteria-based active learning for named entity recognition. In: Proceedings of the 42nd annual meeting on association for computational linguistics (ACL’04), Stroudsburg. Association for Computational Linguistics
Ekbal A, Naskar S, Bandyopadhyay S (2007) Named entity recognition and transliteration in Bengali. Named Entities Recognit Classif Use Spec Issue Lingvisticae Investig J 30(1):95–114
Ekbal A, Bandyopadhyay S (2009) A conditional random field approach for named entity recognition in Bengali and Hindi. Linguist Issues Lang Technol (LiLT) 2(1):1–44
Li W, McCallum A (2004) Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Trans Asian Lang Inf Process 2(3):290–294
Srikanth P, Murthy KN (2008) Named entity recognition for Telugu. In: Proceedings of the IJCNLP-08 workshop on NER for South and South East Asian languages, pp 41–50
Yao L, Sun C, Wu Y, Wang X, Wang X (2011) Biomedical named entity recognition using generalized expectation criteria. Int J Mach Learn Cybern 2(4):235–243
Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag New York Inc., New York
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp 282–289
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora
Joachims T (1999) Making large scale SVM learning practical. MIT Press, Cambridge
Vlachos A (2006) Active annotation. In: Proceedings of EACL 2006 workshop on adaptive text extraction and mining, Trento
Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inform 42(5):905–911
Ekbal A, Bandyopadhyay S (2008) A web-based Bengali news corpus for named entity recognition. Lang Resour Eval J 42(2):173–182
Tjong Kim Sang EF, De Meulder F (2003) Introduction to the Conll-2003 shared task: language independent named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, pp 142–147
Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y (2004) Introduction to the bio-entity recognition task at jnlpba. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA’04). Association for Computational Linguistics, pp 70–75
Lin D, Wu X (2009) Phrase clustering for discriminative learning. In: Proceedings of 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp 1030–1038
Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using Gigaword scale unlabeled data. In: Proceedings of ACL/HLT-08, pp 665–673
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL
Chieu HL, Ng HT (2003) Named entity recognition with a maximum entropy approach. In: Proceedings of CoNLL-2003, HLT-NAACL, pp 160–163
Klein D, Smarr J, Nguyen H, Manning CD (2003) Named entity recognition with character-level models. In: Proceedings of CoNLL-2003, HLT-NAACL, pp 188–191
Wu D, Ngai G, Carput M (2003) A stacked, voted, stacked model for named entity recognition. In: Proceedings of the CoNLL-2003, HLT-NAACL, pp 200–203
Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99
Song Y, Kim E, Lee GG, Yi B (2004) Posbiotm-ner in the shared task of bionlp/nlpba 2004. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004)
Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: NLDB, pp 382–387
Park KM, Kim SH, Rim HC, Hwang YS (2004) Me-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process 5:4–21
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA’04). Association for Computational Linguistics, pp 104–107
Finkel J, Dingare S, Nguyen H, Nissim M, Sinclair G, Manning C (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 88–91
Kim S, Yoon J, Park KM, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: IJCNLP, pp 646–657
Leaman R, Gonzalez G (2008) BANNER: an executable survey of advances in biomedical named entity recognition. In: Proceedings of the pacific symposium on biocomputing, pp 652–663
Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinform 10:233. doi:10.1186/1471-2105-10-233
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Ekbal, A., Saha, S. & Sikdar, U.K. On active annotation for named entity recognition. Int. J. Mach. Learn. & Cyber. 7, 623–640 (2016). https://doi.org/10.1007/s13042-014-0275-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-014-0275-8