Skip to main content
Log in

Utilizing external corpora through kernel function: application in biomedical named entity recognition

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Performance of word sequential labelling tasks like named entity recognition and parts-of-speech tagging largely depends on the features chosen in the task. But, in general representing a word as well as capturing its characteristics properly through a set of features is quite difficult. Moreover, external resources often become essential in order to build a high-performance system. But, acquiring required knowledge demands domain-specific processing and feature engineering. Kernel functions along with support vector machine may offer an alternative way to more efficiently capture similarity between words using both the local context and the external corpora. In this paper, we aim to compute similarity between the words using their context information, syntactic information and occurrence statistics in external corpora. This similarity value is gathered through a kernel function. The proposed kernel function combines two sub-kernels. One of these captures global information through words co-occurrence statistics accumulated from a large corpora. The second kernel captures local semantic information of the words through word specific parse tree fragmentation. We test this proposed kernel using JNLPBA 2004 Biomedical Named Entity Recognition and BioCreative II 2006 Gene Mention Recognition task data-sets. In our experiments, we observe that the proposed method is effective on both the data-sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.nactem.ac.uk/tsujii/GENIA/tagger/.

References

  1. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005)

    MathSciNet  MATH  Google Scholar 

  2. Ando, R.K.: BioCreative II gene mention tagging system at IBM Watson. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (CNIO), vol. 23, pp. 101–103 Madrid, Spain (2007)

  3. Carpenter, B.: LingPipe for 99.99% re-call of gene mentions. In: Proceedings of the Second Bi-oCreative Challenge Evaluation Workshop, vol. 23, pp. 307–309, (2007)

  4. Chen, Y., Liu, F., Manderick, B.: Improving the performance of gene mention recognition system using reformed lexicon-based support vector machine. Margin 500, 2 (2007)

    Google Scholar 

  5. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems, pp. 625–632 (2001)

  6. Cortes, C., Haffner, P., Mohri, M.: Rational kernels: theory and algorithms. J. Mach. Learn. Res. 5, 1035–1062 (2004)

    MathSciNet  MATH  Google Scholar 

  7. Cortes, C., Vapnik, V.: Support-vector net-works. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  8. Eskin, E., Weston, J., Noble, W. S., Leslie, C. S.: Mismatch string kernels for SVM protein classification. In: Advances in neural information processing systems, pp. 1417–1424 (2002)

  9. Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., Sinclair, G.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 88–91 (2004)

  10. Ganchev, K., Crammer, K., Pereira, F., Mann, G., Bellare, K., McCallum, A., Carroll, S., Jin, Y., White, P.: Penn/UMass/CHOP Biocreative II Systems. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop, pp. 119–124, Madrid, Spain, (2007)

  11. Guo, B., Gunn, S.R., Damper, R.I., Nelson, J.D.: Customizing kernel functions for SVM-based hyperspectral image classification. IEEE Trans. Image Process. 17(4), 622–629 (2008)

    Article  MathSciNet  Google Scholar 

  12. Hsu, Y.Y., Kao, H.Y.: Curatable named-entity recognition using semantic relations. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 12(4), 785–792 (2015)

    Article  Google Scholar 

  13. Jumutc, V., Zayakin, P., Borisov, A.: Rank-ing-based kernels in applied biomedical diagnostics using a support vector machine. Int. J Neural Syst. 21(06), 459–473 (2011)

    Article  Google Scholar 

  14. Katrenko, S., Adriaans, P. W.: Using semi-supervised techniques to detect gene mentions. In: Proceedings of the Second BioCreative Challenge Workshop, pp. 97–101, (2007)

  15. Kim, J. D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Colli-er, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 70–75, (2004)

  16. Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., White, P.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68, (2004)

  17. Le L., Xie Y.: Deep Embedding Kernel. arXiv:1804.05806v1 (2018)

  18. Lee, C., Hou, W. J., Chen, H. H.: Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 80–83, (2004)

  19. Leslie, C. S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pacific Symposium on Bio-computing, vol. 7, pp. 566–575, (2002)

  20. Li, J., Zhang, Z., Li, X., Chen, H.: Kernel-based learning for biomedical relation extraction. J. Am. Soc. Inform. Sci. Technol. 59(5), 756–769 (2008)

    Article  Google Scholar 

  21. Li, L., Fan, W., Huang, D.: A two-phase Bio-NER system based on integrated classifiers and multiagent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. 10(4), 897–904 (2013)

    Article  Google Scholar 

  22. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification us-ing string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)

    MATH  Google Scholar 

  23. Moschitti, A.: Making tree kernels practical for natural language learning. In: EACL, vol. 113, No. 120, p. 24, (2006)

  24. Ninomiya, T., Matsuzaki, T., Miyao, Y., Tsujii, J. I.: A log-linear model with an n-gram reference distribution for accurate HPSG parsing. In: Proceedings of the 10th International Conference on Parsing Technologies. Association for Computational Linguistics, pp. 60–68 (2007)

  25. Padierna, L.C., Carpio, M., Rojas-Domínguez, A., Puga, H., Fraire, H.: A novel formulation of orthogonal polynomial kernel functions for SVM classifiers: the Gegenbauer family. Pattern Recognit. 84, 211–225 (2018)

    Article  Google Scholar 

  26. Patra, R., Saha, S.K.: A kernel-based ap-proach for biomedical named entity recognition. Sci. World J. (2013). https://doi.org/10.1155/2013/950796

    Article  Google Scholar 

  27. Patrick, J., Wang, Y.: Biomedical named entity recognition system. In: Proceedings of the Tenth Australasian Document Computing Symposium ADCS (2005)

  28. Saha, S.K., Narayan, S., Sarkar, S., Mitra, P.: A composite kernel for named entity recognition. Pattern Recognit. Lett. 31(12), 1591–1597 (2010)

    Article  Google Scholar 

  29. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 104–107, (2004)

  30. Shankar, K., Lakshmanaprabu, S.K., Gupta, D., et al.: Optimal feature-based multi-kernel SVM approach for thyroid disease classification. J Supercomput. (2018). https://doi.org/10.1007/s11227-018-2469-4

    Article  Google Scholar 

  31. Smith, L., Tanabe, L.K., Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Torii, M.: Over-view of BioCreative II gene mention recognition. Genome Biol. 9(2), S2 (2008)

    Article  Google Scholar 

  32. Song, Y., Kim, E., Lee, G. G., Yi, B.K.: POSBIOTM-NER in the shared task of Bi-oNLP/NLPBA 2004. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 100–103 (2004)

  33. Sun, C., Lin, L., Wang, X., Guan, Y.: Study for Application of Discriminative Models in Bio-medical Literature Mining. In: Proceedings of the Second BioCreative Challenge Evaluation Work-shop Madrid, Spain, pp. 319–321 (2007)

  34. Suzuki, J., Hirao, T., Sasaki, Y., Maeda, E.: Hierarchical directed acyclic graph kernel: methods for structured natural language data. In: Proceedings of the 41st Annual Meeting on Associa-tion for Computational Linguistics. Association for Computational Linguistics, vol. 1 pp. 32–39 (2003)

  35. Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)

    MathSciNet  MATH  Google Scholar 

  36. Yu, S., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11(1), 309 (2010)

    Article  Google Scholar 

  37. Yu Zhang, Yu., Wang, Guoxu Zhou, Jin, Jing, Wang, Bei, Wang, Xingyu, Cichocki, Andrzej: Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 96, 302–310 (2018)

    Article  Google Scholar 

  38. Zhou, GD., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 96–99, (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sujan Kumar Saha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patra, R., Saha, S.K. Utilizing external corpora through kernel function: application in biomedical named entity recognition. Prog Artif Intell 9, 209–219 (2020). https://doi.org/10.1007/s13748-020-00208-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-020-00208-0

Keywords

Navigation