skip to main content
10.1145/1835804.1835845acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Grafting-light: fast, incremental feature selection and structure learning of Markov random fields

Authors Info & Claims
Published:25 July 2010Publication History

ABSTRACT

Feature selection is an important task in order to achieve better generalizability in high dimensional learning, and structure learning of Markov random fields (MRFs) can automatically discover the inherent structures underlying complex data. Both problems can be cast as solving an l1-norm regularized parameter estimation problem. The existing Grafting method can avoid doing inference on dense graphs in structure learning by incrementally selecting new features. However, Grafting performs a greedy step to optimize over free parameters once new features are included. This greedy strategy results in low efficiency when parameter learning is itself non-trivial, such as in MRFs, in which parameter learning depends on an expensive subroutine to calculate gradients. The complexity of calculating gradients in MRFs is typically exponential to the size of maximal cliques.

In this paper, we present a fast algorithm called Grafting-Light to solve the l1-norm regularized maximum likelihood estimation of MRFs for efficient feature selection and structure learning. Grafting-Light iteratively performs one-step of orthant-wise gradient descent over free parameters and selects new features. This lazy strategy is guaranteed to converge to the global optimum and can effectively select significant features. On both synthetic and real data sets, we show that Grafting-Light is much more efficient than Grafting for both feature selection and structure learning, and performs comparably with the optimal batch method that directly optimizes over all the features for feature selection but is much more efficient and accurate for structure learning of MRFs.

Skip Supplemental Material Section

Supplemental Material

kdd2010_zhu_glfi_01.mov

mov

118.2 MB

References

  1. G. Andrew and J.-F. Gao. Scalable training of l1-regularized log-linear models. In ICML, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Banerjee, L. Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. JMLR, (9):485--516, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 0(0):1--10, 2007.Google ScholarGoogle Scholar
  4. N. Friedman. Learning bayesian networks in the presence of missing values and hidden variables. In ML, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Friedman. The bayesian structural em algorithm. In UAI, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Globerson, T. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear structured prediction. In ICML, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, (3):1157--1182, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. M. I. Jordan (Ed.), Learning in Graphical Models, MIT Press, Cambridge, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Kira and L. Rendell. A practical approach to feature selection. In ICML, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273--324, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S.-I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of Markov networks using $\ell_1$-regularization. In NIPS, 2006.Google ScholarGoogle Scholar
  13. D.-C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, (45):503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. McCallum. Efficient inducing features of conditional random fields. In UAI, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Parise and M. Welling. Structure learning in markov random fields. In NIPS, 2006.Google ScholarGoogle Scholar
  16. S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function spaces. JMLR, (3):1333--1356, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Perkins and J. Theiler. Online feature selection using grafting. In ICML, 2003.Google ScholarGoogle Scholar
  18. S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Trans. on PAMI, 19(4):380--393, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Sang and S. Buchholz. Introduction to the conll-2000 shared task: Chunking. In CoNLL, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Sha and F. Pereira. Shallow parsing with conditional random fields. In HLT/NAACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Shevade and S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246--2253, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Royal. Statist. Soc., B(58):267--288, 1996.Google ScholarGoogle Scholar
  25. Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Vishvanathan, N. N. Shraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In ICML, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical model selection using l1-regularized logistic regression. In NIPS, 2006.Google ScholarGoogle Scholar
  28. J. Weston, A. Elisseeff, B. Scholkopf, and M. Tipping. Use of the zero norm with linear models and kernel methods. JMLR, (3):1439--1461, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Willsky. Multiresolution Markov models for signal and image processing. In Proc. of the IEEE, 2002.Google ScholarGoogle Scholar
  30. E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Yuan and Y. Lin. Model selection and estimation in gaussian graphical model. Biometrika, 1(94):19--35, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  33. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In NIPS, 2003.Google ScholarGoogle Scholar
  34. J. Zhu, E. Xing, and B. Zhang. Primal sparse maximum margin Markov networks. In SIGKDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Statist. Soc., B(67):301--320, 2005.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2010
    1240 pages
    ISBN:9781450300551
    DOI:10.1145/1835804

    Copyright © 2010 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 July 2010

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,133of8,635submissions,13%

    Upcoming Conference

    KDD '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader