Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Huang, Han-Shen; Yang, Bo-Hou; Chang, Yu-Ming; Hsu, Chun-Nan

doi:10.1007/s10618-009-0128-3

Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Published: 06 March 2009

Volume 19, pages 58–94, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Han-Shen Huang¹,
Bo-Hou Yang^1,2,
Yu-Ming Chang¹ &
…
Chun-Nan Hsu¹

164 Accesses
1 Citation
Explore all metrics

Abstract

The triple jump extrapolation method is an effective approximation of Aitken’s acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options—global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

A survey of transfer learning

Article Open access 28 May 2016

References

Bauer E, Koller D, Singer Y (1997) Update rules for parameter estimation in Bayesian networks. In: Proceedings of the 13th conference on uncertainty in artificial intelligence (UAI’97), pp 3–13
Berlinet A, Roland C (2007) Acceleration schemes with application to the EM algorithm. Comput Stat Data Anal 51: 3689–3702
Article MathSciNet Google Scholar
Binder J, Koller D, Russell S, Kanazawa K (1997) Adaptive probabilistic networks with hidden variables. Mach Learn 29(2–3): 213–244
Article MATH Google Scholar
Bottou L (2007) Stochastic gradient descent examples on toy problems. http://leon.bottou.org/projects/sgd
Burden RL, Faires D (1988) Numerical analysis. PWS-KENT Pub Co
Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining. MIT Press, Cambridge, MA, USA, pp 153–180
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9: 309–347
MATH Google Scholar
Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Article MATH MathSciNet Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38
MATH MathSciNet Google Scholar
Fraley C (1999) On computing the largest fraction of missing information for the em algorithm and the worst linear function for data augmentation. Comput Stat Data Anal 31(1): 13–26
Article MATH MathSciNet Google Scholar
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. PWS-KENT Pub Co
Hammerlin G, Hoffmann K-H (1991) Numerical mathematics. Springer, New York
Google Scholar
Hesterberg T (2005) Staggered Aitken acceleration for EM. In: Proceedings of the statistical computing section of the American Statistical Association, Minneapolis, Minnesota, USA
Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, Chuang I-F (2008) Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics 24(13):i286–i294. Proceedings of ISMB-2008
Google Scholar
Huang H-S, Yang B-H, Hsu, C-N (2005) Triple-jump acceleration for the EM algorithm. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), pp 649–652
Huang H-S, Yang B-H, Hsu C-N (2007a) TJ²aEM: targeted aggressive extrapolation method for accelerating the EM algorithm. Technical report, Institute of Information Science, Academia Sinica, Taiwan. IIS Technical Report: TR-IIS-07-012
Huang H-S, Chang Y-M, Hsu C-N (2007b) Training conditional random fields by periodic step size adaptation for large-scale text mining. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07), pp 511–516
Jamshidian M, Jennrich RI (1997) Acceleration of the EM algorithm by using quasi-newton methods. J R Stat Soc Ser B 59(3): 569–587
Article MATH MathSciNet Google Scholar
Kapetanios G (2004) On testing for diagonality of large dimensional covariance matrices
Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75
Kudo T (2006) CRF++: yet another CRF toolkit (http://crfpp.sourceforge.net/). Available under LGPL from the following http://crfpp.sourceforge.net/
Kuroda M, Sakakihara M (2006) Accelerating the convergence of the EM algorithm using the vector ε algorithm. Comput Stat Data Anal 51: 1549–1561
Article MATH MathSciNet Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning (ICML’03), pp 282–289
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44: 226–233
MATH MathSciNet Google Scholar
Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the sixth conference on natural language learning (CoNLL-2002), pp 49–55
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley series in probability and statistics. Wiley-Interscience
Meng X-L, Rubin DB (1994) On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra Appl 199: 413–425
Article MATH MathSciNet Google Scholar
Nocedal J, Wright SJ (1999) Numerical optimization. Springer
Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann
Russell S, Binder J, Koller D, Kanazawa K (1995) Local learning in probabilistic networks with hidden variables. In: Proceedings of the fourteenth international joint conference on artificial intelligence (IJCAI’95), pp 1146–1152
Salakhutdinov R, Roweis S (2003) Adaptive overrelaxed bound optimization methods. In: Proceedings of the twentieth international conference on machine learning (ICML’03), pp 664–671
Salakhutdinov R, Roweis S, Ghahramani Z (2003) On the convergence of bound optimization algorithms. In: Conference on uncertainty in artificial intelligence (UAI’03), pp 509–516
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall, New York
MATH Google Scholar
Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of human language technology, the North American Chapter of the Association for Computational Linguistics (NAACL’03), pp 213–220
Thiesson B, Meek C, Heckerman D (2001) Accelerating EM for large databases. Mach Learn 45(3): 279–299
Article MATH Google Scholar
Tjong EF, Sang K, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: Chunking. In: Proceedings of conference on computational natural language learning (CoNLL-2000), pp 127–132
Varadhan R, Roland C (2004) Squared extrapolation methods (SQUAREM): a newclass of simple and efficient numerical schemes for accelerating the convergence of the EM algorithm. Department of Biostatistics Working Paper, Johns Hopkins University, Paper 63
Wilbur J, Smith L, Tanabe L (2007). Biocreative 2. Gene mention task. In: Proceedings of the second biocreative challenge evaluation workshop, pp 7–16

Download references

Author information

Authors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Han-Shen Huang, Bo-Hou Yang, Yu-Ming Chang & Chun-Nan Hsu
Department of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan
Bo-Hou Yang

Authors

Han-Shen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Hou Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Ming Chang
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Nan Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Nan Hsu.

Additional information

Communicated by Charles Elkan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, HS., Yang, BH., Chang, YM. et al. Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields. Data Min Knowl Disc 19, 58–94 (2009). https://doi.org/10.1007/s10618-009-0128-3

Download citation

Received: 14 February 2008
Accepted: 03 February 2009
Published: 06 March 2009
Issue Date: August 2009
DOI: https://doi.org/10.1007/s10618-009-0128-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A survey of transfer learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A survey of transfer learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation