Abstract
As one of the famous probabilistic graph models in machine learning, the conditional random fields (CRFs) can merge different types of features, and encode known relationships between observations and construct consistent interpretations, which have been widely applied in many areas of the Natural Language Processing (NLP). With the high-speed development of the internet and information systems, some performance issues are certain to arise when the traditional CRFs deals with such massive data. This paper proposes SCRFs, which is a parallel optimization of CRFs based on the Resilient Distributed Datasets (RDD) in the Spark computing framework. SCRFs optimizes the traditional CRFs from these stages: First, with all features are generated in parallel, the intermediate data which will be used frequently are all cached into the memory to speed up the iteration efficiency. By removing the low-frequency features of the model, SCRFs can also prevent the overfitting of the model to improve the prediction effect. Second, some specific features are dynamically added in parallel to correct the model in the training process. And for implementing the efficient prediction, a max-sum algorithm is proposed to infer the most likely state sequence by extending the belief propagation algorithm. Finally, we implement SCRFs base on the version of Spark 1.6.0, and evaluate its performance using two widely used benchmarks: Named Entity Recognition and Chinese Word Segmentation. Compared with the traditional CRFs models running on the Hadoop and Spark platforms respectively, the experimental results illustrate that SCRFs has obvious advantages in terms of the model accuracy and the iteration performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Gudivada, V., Baeza-Yates, R., Raghavan, V.: Big data: Promises and problems. Computer 48(3), 20–23 (2015)
Gugnani, S., Blanco, C., Kiss, T., Terstyanszky, G.: Extending science gateway frameworks to support big data applications in the cloud. Journal of Grid Computing, pp. 1–13 (2016)
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML., pp. 282–289. ACM (2001)
Kim, M.: Mixtures of conditional random fields for improved structured output prediction. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1233–1240 (2017)
He, X., Zemel, R.S., Carreira-Perpiñán, M.Á.: Multiscale conditional random fields for image labeling. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II–695. IEEE (2004)
Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer Science & Business Media (2009)
Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-crfs. Knowl. Inf. Syst. 40(2), 439–453 (2013)
Tsai, T.-h., Chou, W.-C., Wu, S.-H., Sung, T.-Y., Hsiang, J., Hsu, W.-L.: Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst. Appl. 30(1), 117–128 (2006)
Settles, B.: Abner: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 134–141. Association for Computational Linguistics (2003)
Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)
Rabiner, L.R., Juang, B.-H.: An introduction to hidden markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)
McCallum, A., Freitag, D., Pereira, F.C.: Maximum entropy markov models for information extraction and segmentation. ICML 17, 591–598 (2000)
Sun, C., Guan, Y., Wang, X., Lin, L.: Rich features based conditional random fields for biological named entities recognition. Comput. Biol. Med. 37(9), 1327–1333 (2007)
Apache, Hadoop, Website. http://hadoop.apache.org (2015)
Spark, Website. http://spark.apache.org (2015)
Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, pp. 93–128 (2006)
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
Vishwanathan, S., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 116. ACM (2004)
Weiss, Y., Freeman, W.T.: On the optiMality of solutions of the max-product belief-propagation algorithm in arbitrary graphs. IEEE Trans. Inf. Theory 47(2), 736–744 (2001)
Yedidia, J.S., Freeman, W.T., Weiss, Y., et al.: Generalized belief propagation. NIPS 13, 689–695 (2000)
David, M.W.P.: Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2007)
Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Programm. 45(1-3), 503–528 (1989)
Pearl, J.: Reverend bayes on inference engines: A distributed hierarchical approach. In: Proceedings of the Second National Conference on Artificial Intelligence, pp. 133–136. AAAI-82. AAAI Press (1982)
Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984)
Rahman, H., Hahn, T., Segall, R.: Advanced feature-driven disease named entity recognition using conditional random fields. In: The ACM International Conference, pp. 469–469 (2016)
Finkel, J., Dingare, S., Nguyen, H: Exploiting context for biomedical entity recognition: from syntax to the web. In: International Joint Workshop on Natural Language Processing in Biomedicine and ITS Applications. Association for Computational Linguistics, pp. 397–406 (2004)
Kim, J.D., Ohta, T., Tateisi, Y.: Genia corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(1), i180–2 (2003)
Tang, Z., Jiang, L., Yang, L., Li, K., Li, K.: Crfs based parallel biomedical named entity recognition algorithm employing mapreduce framework. Clust. Comput. 18(2), 493–505 (2015)
Mai, F., Wu, S., Cui, T.: Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields. Springer International Publishing (2015)
bakeoff2005, Website. http://sighan.cs.uchicago.edu/bakeoff2005/ (2015)
Wang, Y., Lu, W., Lou, R., Wei, B.: Improving mapreduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)
Rasooli, A., Down, D.G.: Guidelines for selecting hadoop schedulers based on system heterogeneity. J. Grid Comput. 12(3), 499–519 (2014)
: Mahout, Website. http://mahout.apache.org (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
del, S., Río, V.L., Benítez, J.M., Herrera, F.: On the use of mapreduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Dahiphale, D., Karve, R., Vasilakos, A.V., Liu, H., Yu, Z., Chhajer, A., Wang, J., Wang, C.: An advanced mapreduce: cloud mapreduce, enhancements and applications. IEEE Trans. Netw. Serv. Manag. 11(1), 101–115 (2014)
Singh, K., Guntuku, S.C., Thakur, A., Hota, C.: Big data analytics framework for peer-to-peer botnet detection using random forests. Inf. Sci. 278, 488–497 (2014)
Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14(3), 1–27 (2016)
Pal, C., Sutton, C., McCallum, A.: Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In: 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. V–V. IEEE (2006)
Cohn, T.: Efficient inference in large conditional random fields. In: Machine Learning: ECML 2006, pp. 606–613. Springer (2006)
Jeong, M., Lin, C.-Y., Lee, G.G.: Efficient inference of crfs for large-scale natural language data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 281–284. Association for Computational Linguistics (2009)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale crfs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513. Association for Computational Linguistics (2010)
Lin, X., Zhao, L., Yu, D., Wu, X.: Distributed training for conditional random fields. In: 2010 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pp. 1–6. IEEE (2010)
Piatkowski, N., Morik, K.: Parallel inference on structured data with crfs on gpus. In: International Workshop at ECML PKDD on Collective Learning and Inference on Structured Data (COLISD2011) (2011)
Li, K., Ai, W., Zhang, F., Jiang, L., Li, K., Hwang, K.: Hadoop recognition of biomedical named entity using conditional random fields. IEEE Trans. Parallel Distrib. Syst. 26(11), 3040–3051 (2015)
Acknowledgements
The work is supported by the National Natural Science Foundation of China (Grant Nos. 61572176) and National High-tech R&D Program of China (2015AA015305).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, Z., Fu, Z., Gong, Z. et al. A Parallel Conditional Random Fields Model Based on Spark Computing Environment. J Grid Computing 15, 323–342 (2017). https://doi.org/10.1007/s10723-017-9404-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-017-9404-4