Abstract
It is difficult to apply machine learning to a domain which is short of labeled training data, such as biomedical named entity recognition (NER) which remains a challenging task because of its extraordinary complex nomenclature. In this paper, we proposed a semi-supervised method which can train condition random field (CRF) models using generalized expectation (GE) criteria to solve biomedical named entity recognition problem. In the proposed method, instead of “instance” labeling, the “feature” labeling is applied to get the training data which can save lots of labeling time. Latent Dirichlet Allocation (LDA) model was involved to choose the features for labeling. Experiment results show that the proposed method can dramatically improve the performance of biomedical NER through incorporating unlabeled data by feature labeling.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Dai H et al (2010) New challenges for biological text-mining in the next decade. J Comput Sci Technol 25(1):169–179
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30:3–26
Hu Q et al (2010) An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int J Mach Learn Cybern 1–12
Kuncheva LI (2010) Full-class set classification using the Hungarian algorithm. Int J Mach Learn Cybern 1(1–4):53–61
Krallinger M et al (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 9(Suppl 2):1
Dai H et al (2008) BIOSMILE web search: a web application for annotating biomedical entities and relations. Nucl Acids Res 36(Web Server issue):W390
Rebholz-Schuhmann D (2008) Text processing through web services: calling Whatizit. Bioinformatics 24(2):296–298
Si L, Kanungo T, Huang X (2005) Boosting performance of bio-entity recognition by combining results from multiple systems. In: Proceedings of the 5th international workshop on Bioinformatics, ACM, pp 76–83
Vlachos A (2007) Evaluating and combining biomedical named entity recognition systems, In: BioNLP 2007: biological, translational, and clinical language processing, pp 199–206
Saha SK, Sarkar S, Mitra PP (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inform 42(5):905–911
Lin YF et al (2004) A maximum entropy approach to biomedical named entity recognition. In: Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, Citeseer, pp 56–61
Lee KJ, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine. Association for Computational Linguistics, pp 33–40
Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338
Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition in the joint workshop on natural language processing in biomedicine and its applications. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA 2004), pp. 96–99
Lee K et al (2004) Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform 37(6):436–447
Nigam K et al (2000) Text classification from labelled and unlabelled documents using EM. Mach Learn 103–134
Brefeld U, Scheffer T (2006) Semi-supervised learning for structured output variables, In: Proceedings of the 23rd international conference on Machine learning, ACM New York, NY, USA: Pittsburgh, Pennsylvania, pp 145–152
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: the ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data, pp 912–919
Altun Y, McAllester D, Belkin M (2006) Maximum margin semi-supervised learning for structured variables. Adv Neural Inf Process Syst 18:33–40
F. Jiao, Wang S, Lee CH, Greiner R, Schuurmans D (2006) Semi-supervised conditional random fields for improved sequence segmentation and labeling, the 21st International Conference on Computational Linguistics, pp 209–216
Small K, Roth D (2010) Margin-based active learning for structured predictions. Int J Mach Learn Cybern 1(1–4):3–25
McCallum A, Mann G, Druck G (2007) Generalized expectation criteria. Computer science technical note. University of Massachusetts, Amherst
Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization, In: Proceedings of the 24th international conference on Machine learning, ACM, pp 593–600
Mann G, McCallum A (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data. J Mach Learn Res 11:955–984
Druck G, Mann G, McCallum A (2007) Reducing annotation effort using generalized expectation criteria (Technical Report 2007-62), University of Massachusetts, Amherst
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Rabiner L (1989) A tutorial on Hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, 282–289
Wallach H (2004) Conditional random fields: an introduction. Technical Report MS-CIS-04-21, Department of Computer and Information Science, University of Pennsylvania, p 50
Mann, G, McCallum A (2008) Generalized expectation criteria for semi-supervised learning of conditional random fields. In: Proceeding of Association of Computational Linguistics, pp 870–878
Raghavan H, Madani O, Jones R (2006) Active learning with feedback on features and instances. J Mach Learn Res 7:1655–1686
Sun C et al (2007) Rich features based conditional random fields for biological named entities recognition. Comput Biol Med 37(9):1327–1333
Tsai T et al (2006) Integrating linguistic knowledge into a conditional random field framework to identify biomedical named entities. Expert Syst Appl 30(1):117–128
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: International Conference on Computational Linguistics. Geneva, Switzerland, pp 104–107
Tsai T, Wu C, Hsu W (2005) Using maximum entropy to extract biomedical named entities without dictionaries. In: Proceedings of IJCNLP2005, pp 270–275
Deerwester S et al (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Wenbo L, Le S, Dakun Z (2008) Text classification based on labeled-LDA model. Chinese J Comput 31(4):620–627
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196
Landauer TK, Foltz PPW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2):259–284
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceeding of Uncertainty in Artificial Intelligence, Citeseer, pp 21–26
Boyd-Graber J, Blei D, Zhu X (2007) A topic model for word sense disambiguation. In empirical methods in natural language processing, pp 1024–1033
Toutanova K, Johnson M (2007) A Bayesian LDA-based model for semi-supervised part-of-speech tagging. Adv Neural Inf Process Syst 1521–1528
Georgescul M, Clark A, Armstrong S (2008) A comparative study of mixture models for automatic topic segmentation of multiparty dialogues. In: Proceedings of the Third International Joint Conference on Natural Language Processing, pp 925–930
Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data. ACM, pp 91–97
McCallum AK (2002) MALLET: a Machine Learning for Language Toolkit. http://mallet.cs.umass.edu
Acknowledgment
This work is supported by National Natural Science Foundation of China (60973076, 61073127), Research Fund for the Doctoral Program of Higher Education of China (20102302120053) and the Fundamental Research Funds for the Central Universities (Grant on HIT.NSRIF.2010045).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yao, L., Sun, C., Wu, Y. et al. Biomedical named entity recognition using generalized expectation criteria. Int. J. Mach. Learn. & Cyber. 2, 235–243 (2011). https://doi.org/10.1007/s13042-011-0022-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-011-0022-3