Skip to main content
Log in

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

As the rapid growth of the biomedical literature, the model training time in biomedical named entity recognition increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, this paper proposes an optimization method for two-phase recognition using conditional random fields. In the first stage, each named entity boundary is detected to distinguish all real entities. In the second stage, we label the semantic class of the entity detected. To expedite the training speed, in these two phases, we implement the model training process on a parallel optimization program framework based on MapReduce. Through dividing the training set into several parts, the iterations in the training algorithm are designed as map tasks which can be executed simultaneously in a cluster, where each map function is designed to complete the calculation of a gradient vector component for each part in the training set. Our experiments show that the proposed method in this paper can achieve high performance with short training time, which has important implications for the current biological big data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Wikipedia, Text mining [EB/OL]. http://en.wikipedia.org/wiki/Text_mining. 24 Oct 2013

  2. Wikipedia, Named-entity recognition [EB/OL]. http://en.wikipedia.org/wiki/Named_entity_recognition. 22 Aug 2013

  3. Wikipedia, MEDLINE [EB/OL]. http://en.wikipedia.org/wiki/MEDLINE. 14 Sep 2013

  4. Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Francisco (2010). doi:10.2200/S00274ED1V01Y201006HLT007

    Google Scholar 

  5. Shen, L., Shen, H., Cheng, L.: New algorithms for efficient mining of association rules. In: The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp. 234–241 (1999)

  6. Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334–338 (2009)

    Article  Google Scholar 

  7. Finkel, J., Dingare, S., Nguyen, H.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 88–91 (2004)

  8. Wang, H., Zhao, T., Li, S., Yu, H.: A conditional random fields approach to biomedical named entity recognition. J. Electron. 6(24), 838–844 (2007)

    Google Scholar 

  9. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 104–107 (2004)

  10. Li, L., Fan, W., Huang, D.: A two-phase bio-NER system based on integrated classifiers and multi-agent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. (2013). doi:10.1109/TCBB.2013.106

  11. Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs. Knowl. Inf. Syst. (2013). doi:10.1007/s10115-013-0637-7

  12. Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine (BioMed), pp. 33–40 (2003)

  13. Kim, S., Yoon, J., Park, K.-M., Rim, H.-C.: Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd International Joint Conference (IJCNLP), pp. 646–657 (2005)

  14. Kim, S., Yoon, J.: Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans. Inf. Syst. 7(E90–D), 1103–1110 (2007)

    Article  Google Scholar 

  15. Li, Lishuang, Zhou, Rongpeng, Huang, Degen: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334–338 (2009)

    Article  Google Scholar 

  16. Wang, L., Ke, L., Liu, P., Ranjan, R., Chen, L.: IK-SVD: dictionary learning for spatial big data via incremental atom update. Comput. Sci. Eng. 16(4), 41–52 (2014)

    Article  Google Scholar 

  17. Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)

    Article  MATH  Google Scholar 

  18. Wittek, P., Darányi, S.: Accelerating text mining workloads in a MapReduce-based distributed GPU environment. J. Parallel Distrib. Comput. 2(73), 98–206 (2013)

    Google Scholar 

  19. Wang, L., Tao, J., Marten, H., Streit, A., Khan, S.U., Kolodziej, J., Chen, D.: MapReduce across distributed clusters for data-intensive applications. In: The 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshops 2012: 2004–2011

  20. Laclavik, M., Seleng, M., Hluchy, L.: Towards large scale semantic annotation built on MapReduce architecture. Lecture Notes in Computer Science 3(5103), 331–338 (2008)

  21. Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.: G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013)

    Article  Google Scholar 

  22. Whitney, M., Clifton, A., Sarkar, A., Fedorova, A.: Making the most of a distributed perceptron for NLP. In: Pacific Northwest Regional NLP Workshop, Redmond, Washington, USA (2012)

  23. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 27th Proceedings of the International Conference on Machine Learning (ICML), pp. 282–289 (2010)

  24. Atkinson, J., Bull, V.: A multi-strategy approach to biological named entity recognition. Expert Syst. Appl. 39(17), 12968–12974 (2012)

    Article  Google Scholar 

  25. Forney, G.D. Jr.: The viterbi algorithm. In: Proceedings of the IEEE, vol. 3(61), pp. 268–278. Codex Corporation. Newton, MA (2005)

  26. Vijay Sundar Ram, R., Akilandeswari, A., Lalitha Devi, S.: Linguistic features for named entity recognition using CRFs. In: International Conference on Asian Language Processing (IALP), pp. 158–161 (2010)

  27. Langford, J.: Parallel machine learning on big data, XRDS: crossroads. ACM Mag. Stud. 1(19), 60–62 (2012)

  28. Meraji, S., Tropper, C.: A machine learning approach for optimizing parallel logic simulation. In: 39th International Conference on Parallel Processing (ICPP), pp. 545–554 (2010)

  29. Livieris, I.E., Apostolopoulou, M.S., Sotiropoulos, D.G., Sioutas, S., Pintelas, P.: Classification of large biomedical data using ANNs based on BFGS method. In: 13th Panhellenic Conference on Informatics (PCI), pp. 87–91 (2009)

  30. Munkhdalai, T., Li, M., Kim, T., Namsrai, O.-E., Jeong, S.-p., Shin, J., Ryu, K.H.: Bio named entity recognition based on co-training algorithm. In: 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 857–862 (2012)

  31. Zhang, J., Shen, D., Zhou, G., Tan, C.-L.: Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 6(37), 411–422 (2004)

    Article  Google Scholar 

  32. Mathur, A., Chakrabarti, S.: Accelerating newton optimization for log-linear models through feature redundancy. In: 6th International Conference on Data Mining, pp. 404–413 (2006)

  33. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  34. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. J. Math. Program. B 3(45), 503–528 (1989)

    Article  MathSciNet  Google Scholar 

  35. Wang, L., Chen, D., Ranjan, R., Khan, S.U., Kolodziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. In: The 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 164–171 (2012)

  36. Guodong, Z., Jian, S.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 96–99 (2004)

  37. Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 465–472 (2006)

  38. Zhao, Jiaqi, Wang, Lizhe, Tao, Jie, Chen, Jinjun, Sun, Weiye, Ranjan, Rajiv, Kolodziej, Joanna, Streit, Achim, Georgakopoulos, Dimitrios: A security framework in G-Hadoop for big data computing across distributed cloud data centres. J. Comput. Syst. Sci. 80(5), 994–1007 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  39. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)

Download references

Acknowledgments

The authors are grateful to the three anonymous reviewers for their criticism and comments which have helped to improve the presentation and quality of the paper. This work is supported by the Key Program of National Natural Science Foundation of China (Grant No. 61133005), and National Natural Science Foundation of China (Grant Nos. 61370095,61432005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhuo Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Z., Jiang, L., Yang, L. et al. CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Cluster Comput 18, 493–505 (2015). https://doi.org/10.1007/s10586-015-0426-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0426-z

Keywords

Navigation