Skip to main content

Advertisement

Log in

Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The triple jump extrapolation method is an effective approximation of Aitken’s acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options—global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bauer E, Koller D, Singer Y (1997) Update rules for parameter estimation in Bayesian networks. In: Proceedings of the 13th conference on uncertainty in artificial intelligence (UAI’97), pp 3–13

  • Berlinet A, Roland C (2007) Acceleration schemes with application to the EM algorithm. Comput Stat Data Anal 51: 3689–3702

    Article  MathSciNet  Google Scholar 

  • Binder J, Koller D, Russell S, Kanazawa K (1997) Adaptive probabilistic networks with hidden variables. Mach Learn 29(2–3): 213–244

    Article  MATH  Google Scholar 

  • Bottou L (2007) Stochastic gradient descent examples on toy problems. http://leon.bottou.org/projects/sgd

  • Burden RL, Faires D (1988) Numerical analysis. PWS-KENT Pub Co

  • Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining. MIT Press, Cambridge, MA, USA, pp 153–180

  • Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9: 309–347

    MATH  Google Scholar 

  • Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480

    Article  MATH  MathSciNet  Google Scholar 

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38

    MATH  MathSciNet  Google Scholar 

  • Fraley C (1999) On computing the largest fraction of missing information for the em algorithm and the worst linear function for data augmentation. Comput Stat Data Anal 31(1): 13–26

    Article  MATH  MathSciNet  Google Scholar 

  • Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. PWS-KENT Pub Co

  • Hammerlin G, Hoffmann K-H (1991) Numerical mathematics. Springer, New York

    Google Scholar 

  • Hesterberg T (2005) Staggered Aitken acceleration for EM. In: Proceedings of the statistical computing section of the American Statistical Association, Minneapolis, Minnesota, USA

  • Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, Chuang I-F (2008) Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics 24(13):i286–i294. Proceedings of ISMB-2008

    Google Scholar 

  • Huang H-S, Yang B-H, Hsu, C-N (2005) Triple-jump acceleration for the EM algorithm. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), pp 649–652

  • Huang H-S, Yang B-H, Hsu C-N (2007a) TJ2aEM: targeted aggressive extrapolation method for accelerating the EM algorithm. Technical report, Institute of Information Science, Academia Sinica, Taiwan. IIS Technical Report: TR-IIS-07-012

  • Huang H-S, Chang Y-M, Hsu C-N (2007b) Training conditional random fields by periodic step size adaptation for large-scale text mining. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07), pp 511–516

  • Jamshidian M, Jennrich RI (1997) Acceleration of the EM algorithm by using quasi-newton methods. J R Stat Soc Ser B 59(3): 569–587

    Article  MATH  MathSciNet  Google Scholar 

  • Kapetanios G (2004) On testing for diagonality of large dimensional covariance matrices

  • Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75

  • Kudo T (2006) CRF++: yet another CRF toolkit (http://crfpp.sourceforge.net/). Available under LGPL from the following http://crfpp.sourceforge.net/

  • Kuroda M, Sakakihara M (2006) Accelerating the convergence of the EM algorithm using the vector ε algorithm. Comput Stat Data Anal 51: 1549–1561

    Article  MATH  MathSciNet  Google Scholar 

  • Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning (ICML’03), pp 282–289

  • Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44: 226–233

    MATH  MathSciNet  Google Scholar 

  • Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the sixth conference on natural language learning (CoNLL-2002), pp 49–55

  • McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley series in probability and statistics. Wiley-Interscience

  • Meng X-L, Rubin DB (1994) On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra Appl 199: 413–425

    Article  MATH  MathSciNet  Google Scholar 

  • Nocedal J, Wright SJ (1999) Numerical optimization. Springer

  • Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann

  • Russell S, Binder J, Koller D, Kanazawa K (1995) Local learning in probabilistic networks with hidden variables. In: Proceedings of the fourteenth international joint conference on artificial intelligence (IJCAI’95), pp 1146–1152

  • Salakhutdinov R, Roweis S (2003) Adaptive overrelaxed bound optimization methods. In: Proceedings of the twentieth international conference on machine learning (ICML’03), pp 664–671

  • Salakhutdinov R, Roweis S, Ghahramani Z (2003) On the convergence of bound optimization algorithms. In: Conference on uncertainty in artificial intelligence (UAI’03), pp 509–516

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall, New York

    MATH  Google Scholar 

  • Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107

  • Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of human language technology, the North American Chapter of the Association for Computational Linguistics (NAACL’03), pp 213–220

  • Thiesson B, Meek C, Heckerman D (2001) Accelerating EM for large databases. Mach Learn 45(3): 279–299

    Article  MATH  Google Scholar 

  • Tjong EF, Sang K, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: Chunking. In: Proceedings of conference on computational natural language learning (CoNLL-2000), pp 127–132

  • Varadhan R, Roland C (2004) Squared extrapolation methods (SQUAREM): a newclass of simple and efficient numerical schemes for accelerating the convergence of the EM algorithm. Department of Biostatistics Working Paper, Johns Hopkins University, Paper 63

  • Wilbur J, Smith L, Tanabe L (2007). Biocreative 2. Gene mention task. In: Proceedings of the second biocreative challenge evaluation workshop, pp 7–16

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chun-Nan Hsu.

Additional information

Communicated by Charles Elkan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, HS., Yang, BH., Chang, YM. et al. Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields. Data Min Knowl Disc 19, 58–94 (2009). https://doi.org/10.1007/s10618-009-0128-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0128-3

Keywords

Navigation