Skip to main content
Log in

A Parallel Conditional Random Fields Model Based on Spark Computing Environment

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

As one of the famous probabilistic graph models in machine learning, the conditional random fields (CRFs) can merge different types of features, and encode known relationships between observations and construct consistent interpretations, which have been widely applied in many areas of the Natural Language Processing (NLP). With the high-speed development of the internet and information systems, some performance issues are certain to arise when the traditional CRFs deals with such massive data. This paper proposes SCRFs, which is a parallel optimization of CRFs based on the Resilient Distributed Datasets (RDD) in the Spark computing framework. SCRFs optimizes the traditional CRFs from these stages: First, with all features are generated in parallel, the intermediate data which will be used frequently are all cached into the memory to speed up the iteration efficiency. By removing the low-frequency features of the model, SCRFs can also prevent the overfitting of the model to improve the prediction effect. Second, some specific features are dynamically added in parallel to correct the model in the training process. And for implementing the efficient prediction, a max-sum algorithm is proposed to infer the most likely state sequence by extending the belief propagation algorithm. Finally, we implement SCRFs base on the version of Spark 1.6.0, and evaluate its performance using two widely used benchmarks: Named Entity Recognition and Chinese Word Segmentation. Compared with the traditional CRFs models running on the Hadoop and Spark platforms respectively, the experimental results illustrate that SCRFs has obvious advantages in terms of the model accuracy and the iteration performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Gudivada, V., Baeza-Yates, R., Raghavan, V.: Big data: Promises and problems. Computer 48(3), 20–23 (2015)

    Article  Google Scholar 

  2. Gugnani, S., Blanco, C., Kiss, T., Terstyanszky, G.: Extending science gateway frameworks to support big data applications in the cloud. Journal of Grid Computing, pp. 1–13 (2016)

  3. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML., pp. 282–289. ACM (2001)

  4. Kim, M.: Mixtures of conditional random fields for improved structured output prediction. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1233–1240 (2017)

    Article  Google Scholar 

  5. He, X., Zemel, R.S., Carreira-Perpiñán, M.Á.: Multiscale conditional random fields for image labeling. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II–695. IEEE (2004)

  6. Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer Science & Business Media (2009)

  7. Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-crfs. Knowl. Inf. Syst. 40(2), 439–453 (2013)

    Article  MathSciNet  Google Scholar 

  8. Tsai, T.-h., Chou, W.-C., Wu, S.-H., Sung, T.-Y., Hsiang, J., Hsu, W.-L.: Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst. Appl. 30(1), 117–128 (2006)

    Article  Google Scholar 

  9. Settles, B.: Abner: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)

    Article  Google Scholar 

  10. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 134–141. Association for Computational Linguistics (2003)

  11. Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)

    Article  Google Scholar 

  12. Rabiner, L.R., Juang, B.-H.: An introduction to hidden markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)

    Article  Google Scholar 

  13. McCallum, A., Freitag, D., Pereira, F.C.: Maximum entropy markov models for information extraction and segmentation. ICML 17, 591–598 (2000)

    Google Scholar 

  14. Sun, C., Guan, Y., Wang, X., Lin, L.: Rich features based conditional random fields for biological named entities recognition. Comput. Biol. Med. 37(9), 1327–1333 (2007)

    Article  Google Scholar 

  15. Apache, Hadoop, Website. http://hadoop.apache.org (2015)

  16. Spark, Website. http://spark.apache.org (2015)

  17. Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, pp. 93–128 (2006)

  18. Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  19. Vishwanathan, S., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)

  20. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)

  21. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 116. ACM (2004)

  22. Weiss, Y., Freeman, W.T.: On the optiMality of solutions of the max-product belief-propagation algorithm in arbitrary graphs. IEEE Trans. Inf. Theory 47(2), 736–744 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  23. Yedidia, J.S., Freeman, W.T., Weiss, Y., et al.: Generalized belief propagation. NIPS 13, 689–695 (2000)

    Google Scholar 

  24. David, M.W.P.: Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2007)

    Google Scholar 

  25. Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Programm. 45(1-3), 503–528 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  26. Pearl, J.: Reverend bayes on inference engines: A distributed hierarchical approach. In: Proceedings of the Second National Conference on Artificial Intelligence, pp. 133–136. AAAI-82. AAAI Press (1982)

  27. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984)

    Article  MATH  Google Scholar 

  28. Rahman, H., Hahn, T., Segall, R.: Advanced feature-driven disease named entity recognition using conditional random fields. In: The ACM International Conference, pp. 469–469 (2016)

  29. Finkel, J., Dingare, S., Nguyen, H: Exploiting context for biomedical entity recognition: from syntax to the web. In: International Joint Workshop on Natural Language Processing in Biomedicine and ITS Applications. Association for Computational Linguistics, pp. 397–406 (2004)

  30. Kim, J.D., Ohta, T., Tateisi, Y.: Genia corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(1), i180–2 (2003)

    Article  Google Scholar 

  31. Tang, Z., Jiang, L., Yang, L., Li, K., Li, K.: Crfs based parallel biomedical named entity recognition algorithm employing mapreduce framework. Clust. Comput. 18(2), 493–505 (2015)

    Article  Google Scholar 

  32. Mai, F., Wu, S., Cui, T.: Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields. Springer International Publishing (2015)

  33. bakeoff2005, Website. http://sighan.cs.uchicago.edu/bakeoff2005/ (2015)

  34. Wang, Y., Lu, W., Lou, R., Wei, B.: Improving mapreduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)

    Article  Google Scholar 

  35. Rasooli, A., Down, D.G.: Guidelines for selecting hadoop schedulers based on system heterogeneity. J. Grid Comput. 12(3), 499–519 (2014)

    Article  Google Scholar 

  36. : Mahout, Website. http://mahout.apache.org (2015)

  37. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  38. del, S., Río, V.L., Benítez, J.M., Herrera, F.: On the use of mapreduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)

    Article  Google Scholar 

  39. Dahiphale, D., Karve, R., Vasilakos, A.V., Liu, H., Yu, Z., Chhajer, A., Wang, J., Wang, C.: An advanced mapreduce: cloud mapreduce, enhancements and applications. IEEE Trans. Netw. Serv. Manag. 11(1), 101–115 (2014)

    Article  Google Scholar 

  40. Singh, K., Guntuku, S.C., Thakur, A., Hota, C.: Big data analytics framework for peer-to-peer botnet detection using random forests. Inf. Sci. 278, 488–497 (2014)

    Article  Google Scholar 

  41. Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14(3), 1–27 (2016)

    Article  Google Scholar 

  42. Pal, C., Sutton, C., McCallum, A.: Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In: 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. V–V. IEEE (2006)

  43. Cohn, T.: Efficient inference in large conditional random fields. In: Machine Learning: ECML 2006, pp. 606–613. Springer (2006)

  44. Jeong, M., Lin, C.-Y., Lee, G.G.: Efficient inference of crfs for large-scale natural language data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 281–284. Association for Computational Linguistics (2009)

  45. Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale crfs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513. Association for Computational Linguistics (2010)

  46. Lin, X., Zhao, L., Yu, D., Wu, X.: Distributed training for conditional random fields. In: 2010 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pp. 1–6. IEEE (2010)

  47. Piatkowski, N., Morik, K.: Parallel inference on structured data with crfs on gpus. In: International Workshop at ECML PKDD on Collective Learning and Inference on Structured Data (COLISD2011) (2011)

  48. Li, K., Ai, W., Zhang, F., Jiang, L., Li, K., Hwang, K.: Hadoop recognition of biomedical named entity using conditional random fields. IEEE Trans. Parallel Distrib. Syst. 26(11), 3040–3051 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant Nos. 61572176) and National High-tech R&D Program of China (2015AA015305).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhuo Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Z., Fu, Z., Gong, Z. et al. A Parallel Conditional Random Fields Model Based on Spark Computing Environment. J Grid Computing 15, 323–342 (2017). https://doi.org/10.1007/s10723-017-9404-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-017-9404-4

Keywords

Navigation