A Parallel Conditional Random Fields Model Based on Spark Computing Environment

Tang, Zhuo; Fu, Zhongming; Gong, Zherong; Li, Kenli; Li, Keqin

doi:10.1007/s10723-017-9404-4

A Parallel Conditional Random Fields Model Based on Spark Computing Environment

Published: 04 July 2017

Volume 15, pages 323–342, (2017)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Zhuo Tang¹,
Zhongming Fu¹,
Zherong Gong¹,
Kenli Li¹ &
…
Keqin Li²

209 Accesses
9 Citations
Explore all metrics

Abstract

As one of the famous probabilistic graph models in machine learning, the conditional random fields (CRFs) can merge different types of features, and encode known relationships between observations and construct consistent interpretations, which have been widely applied in many areas of the Natural Language Processing (NLP). With the high-speed development of the internet and information systems, some performance issues are certain to arise when the traditional CRFs deals with such massive data. This paper proposes SCRFs, which is a parallel optimization of CRFs based on the Resilient Distributed Datasets (RDD) in the Spark computing framework. SCRFs optimizes the traditional CRFs from these stages: First, with all features are generated in parallel, the intermediate data which will be used frequently are all cached into the memory to speed up the iteration efficiency. By removing the low-frequency features of the model, SCRFs can also prevent the overfitting of the model to improve the prediction effect. Second, some specific features are dynamically added in parallel to correct the model in the training process. And for implementing the efficient prediction, a max-sum algorithm is proposed to infer the most likely state sequence by extending the belief propagation algorithm. Finally, we implement SCRFs base on the version of Spark 1.6.0, and evaluate its performance using two widely used benchmarks: Named Entity Recognition and Chinese Word Segmentation. Compared with the traditional CRFs models running on the Hadoop and Spark platforms respectively, the experimental results illustrate that SCRFs has obvious advantages in terms of the model accuracy and the iteration performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In-Memory Distributed Training of Linear-Chain Conditional Random Fields with an Application to Fine-Grained Named Entity Recognition

Named entity recognition based on conditional random fields

Article 08 September 2017

Strategies to Select Examples for Active Learning with Conditional Random Fields

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Gudivada, V., Baeza-Yates, R., Raghavan, V.: Big data: Promises and problems. Computer 48(3), 20–23 (2015)
Article Google Scholar
Gugnani, S., Blanco, C., Kiss, T., Terstyanszky, G.: Extending science gateway frameworks to support big data applications in the cloud. Journal of Grid Computing, pp. 1–13 (2016)
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML., pp. 282–289. ACM (2001)
Kim, M.: Mixtures of conditional random fields for improved structured output prediction. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1233–1240 (2017)
Article Google Scholar
He, X., Zemel, R.S., Carreira-Perpiñán, M.Á.: Multiscale conditional random fields for image labeling. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II–695. IEEE (2004)
Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer Science & Business Media (2009)
Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-crfs. Knowl. Inf. Syst. 40(2), 439–453 (2013)
Article MathSciNet Google Scholar
Tsai, T.-h., Chou, W.-C., Wu, S.-H., Sung, T.-Y., Hsiang, J., Hsu, W.-L.: Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst. Appl. 30(1), 117–128 (2006)
Article Google Scholar
Settles, B.: Abner: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Article Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 134–141. Association for Computational Linguistics (2003)
Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)
Article Google Scholar
Rabiner, L.R., Juang, B.-H.: An introduction to hidden markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)
Article Google Scholar
McCallum, A., Freitag, D., Pereira, F.C.: Maximum entropy markov models for information extraction and segmentation. ICML 17, 591–598 (2000)
Google Scholar
Sun, C., Guan, Y., Wang, X., Lin, L.: Rich features based conditional random fields for biological named entities recognition. Comput. Biol. Med. 37(9), 1327–1333 (2007)
Article Google Scholar
Apache, Hadoop, Website. http://hadoop.apache.org (2015)
Spark, Website. http://spark.apache.org (2015)
Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, pp. 93–128 (2006)
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
Article MathSciNet MATH Google Scholar
Vishwanathan, S., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 116. ACM (2004)
Weiss, Y., Freeman, W.T.: On the optiMality of solutions of the max-product belief-propagation algorithm in arbitrary graphs. IEEE Trans. Inf. Theory 47(2), 736–744 (2001)
Article MathSciNet MATH Google Scholar
Yedidia, J.S., Freeman, W.T., Weiss, Y., et al.: Generalized belief propagation. NIPS 13, 689–695 (2000)
Google Scholar
David, M.W.P.: Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2007)
Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Programm. 45(1-3), 503–528 (1989)
Article MathSciNet MATH Google Scholar
Pearl, J.: Reverend bayes on inference engines: A distributed hierarchical approach. In: Proceedings of the Second National Conference on Artificial Intelligence, pp. 133–136. AAAI-82. AAAI Press (1982)
Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984)
Article MATH Google Scholar
Rahman, H., Hahn, T., Segall, R.: Advanced feature-driven disease named entity recognition using conditional random fields. In: The ACM International Conference, pp. 469–469 (2016)
Finkel, J., Dingare, S., Nguyen, H: Exploiting context for biomedical entity recognition: from syntax to the web. In: International Joint Workshop on Natural Language Processing in Biomedicine and ITS Applications. Association for Computational Linguistics, pp. 397–406 (2004)
Kim, J.D., Ohta, T., Tateisi, Y.: Genia corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(1), i180–2 (2003)
Article Google Scholar
Tang, Z., Jiang, L., Yang, L., Li, K., Li, K.: Crfs based parallel biomedical named entity recognition algorithm employing mapreduce framework. Clust. Comput. 18(2), 493–505 (2015)
Article Google Scholar
Mai, F., Wu, S., Cui, T.: Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields. Springer International Publishing (2015)
bakeoff2005, Website. http://sighan.cs.uchicago.edu/bakeoff2005/ (2015)
Wang, Y., Lu, W., Lou, R., Wei, B.: Improving mapreduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)
Article Google Scholar
Rasooli, A., Down, D.G.: Guidelines for selecting hadoop schedulers based on system heterogeneity. J. Grid Comput. 12(3), 499–519 (2014)
Article Google Scholar
: Mahout, Website. http://mahout.apache.org (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
del, S., Río, V.L., Benítez, J.M., Herrera, F.: On the use of mapreduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Article Google Scholar
Dahiphale, D., Karve, R., Vasilakos, A.V., Liu, H., Yu, Z., Chhajer, A., Wang, J., Wang, C.: An advanced mapreduce: cloud mapreduce, enhancements and applications. IEEE Trans. Netw. Serv. Manag. 11(1), 101–115 (2014)
Article Google Scholar
Singh, K., Guntuku, S.C., Thakur, A., Hota, C.: Big data analytics framework for peer-to-peer botnet detection using random forests. Inf. Sci. 278, 488–497 (2014)
Article Google Scholar
Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14(3), 1–27 (2016)
Article Google Scholar
Pal, C., Sutton, C., McCallum, A.: Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In: 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. V–V. IEEE (2006)
Cohn, T.: Efficient inference in large conditional random fields. In: Machine Learning: ECML 2006, pp. 606–613. Springer (2006)
Jeong, M., Lin, C.-Y., Lee, G.G.: Efficient inference of crfs for large-scale natural language data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 281–284. Association for Computational Linguistics (2009)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale crfs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513. Association for Computational Linguistics (2010)
Lin, X., Zhao, L., Yu, D., Wu, X.: Distributed training for conditional random fields. In: 2010 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pp. 1–6. IEEE (2010)
Piatkowski, N., Morik, K.: Parallel inference on structured data with crfs on gpus. In: International Workshop at ECML PKDD on Collective Learning and Inference on Structured Data (COLISD2011) (2011)
Li, K., Ai, W., Zhang, F., Jiang, L., Li, K., Hwang, K.: Hadoop recognition of biomedical named entity using conditional random fields. IEEE Trans. Parallel Distrib. Syst. 26(11), 3040–3051 (2015)
Article Google Scholar

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant Nos. 61572176) and National High-tech R&D Program of China (2015AA015305).

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Zhuo Tang, Zhongming Fu, Zherong Gong & Kenli Li
Department of Computer Science, State University of New York, New Paltz, New York, 12561, USA
Keqin Li

Authors

Zhuo Tang
View author publications
You can also search for this author inPubMed Google Scholar
Zhongming Fu
View author publications
You can also search for this author inPubMed Google Scholar
Zherong Gong
View author publications
You can also search for this author inPubMed Google Scholar
Kenli Li
View author publications
You can also search for this author inPubMed Google Scholar
Keqin Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhuo Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, Z., Fu, Z., Gong, Z. et al. A Parallel Conditional Random Fields Model Based on Spark Computing Environment. J Grid Computing 15, 323–342 (2017). https://doi.org/10.1007/s10723-017-9404-4

Download citation

Received: 06 April 2017
Accepted: 14 June 2017
Published: 04 July 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10723-017-9404-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Parallel Conditional Random Fields Model Based on Spark Computing Environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

In-Memory Distributed Training of Linear-Chain Conditional Random Fields with an Application to Fine-Grained Named Entity Recognition

Named entity recognition based on conditional random fields

Strategies to Select Examples for Active Learning with Conditional Random Fields

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now