Skip to main content
Log in

A sparse \({\varvec{L}}_{2}\)-regularized support vector machines for efficient natural language learning

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Linear kernel support vector machines (SVMs) using either \(L_{1}\)-norm or \(L_{2}\)-norm have emerged as an important and wildly used classification algorithm for many applications such as text chunking, part-of-speech tagging, information retrieval, and dependency parsing. \(L_{2}\)-norm SVMs usually provide slightly better accuracy than \(L_{1}\)-SVMs in most tasks. However, \(L_{2}\)-norm SVMs produce too many near-but-nonzero feature weights that are highly time-consuming when computing nonsignificant weights. In this paper, we present a cutting-weight algorithm to guide the optimization process of the \(L_{2}\)-SVMs toward a sparse solution. Before checking the optimality, our method automatically discards a set of near-but-nonzero feature weight. The final objects can then be achieved when the objective function is met by the remaining features and hypothesis. One characteristic of our cutting-weight algorithm is that it requires no changes in the original learning objects. To verify this concept, we conduct the experiments using three well-known benchmarks, i.e., CoNLL-2000 text chunking, SIGHAN-3 Chinese word segmentation, and Chinese word dependency parsing. Our method achieves 1–10 times feature parameter reduction rates in comparison with the original \(L_{2}\)-SVMs, slightly better accuracy with a lower training time cost. In terms of run-time efficiency, our method is reasonably faster than the original \(L_{2}\)-regularized SVMs. For example, our sparse \(L_{2}\)-SVMs is 2.55 times faster than the original \(L_{2}\)-SVMs with the same accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Similar to IOB2 tagging scheme, the 6-tags: S/B/BI/I/IE/E indicate the Single/Begin/SecondBegin/ Interior/BeforeEnd/End of a chunk.

  2. http://lcg-www.uia.ac.be/conll2000/chunking/conlleval.txt.

  3. \(C=0.1/1\) for the CoNLL-2000 chunking and Chinese word dependency parsing tasks/SIGHAN-3 datasets.

  4. The settings of hyper-parameter is the same as previous literatures [28, 34].

  5. http://140.115.112.118/bcbb/CMM-CoNLL/sparseSVM.htm.

References

  1. Ando RK, Zhang T (2005) A high-performance semi-supervised learning method for text chunking. In: Proceedings of the annual meeting of the association of computational linguistics, pp 1–9

  2. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the annual ACM workshop on computational learning theory, pp 144–152

  3. Buchholz S, Marsi E (2006) CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 149–164

  4. Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods in natural language processing, pp 1–8

  5. Daumé H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: Proceedings of international conference on machine learning, pp 169–176

  6. Dhir CS, Lee J, Lee SY (2012) Extraction of independent discriminant features for data with asymmetric distribution. J Knowl Inf Syst 30(2):359–375

    Article  Google Scholar 

  7. Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the international conference on machine learning, pp 319–326

  8. Fan TK, Cang CH (2010) Sentiment oriented contextual advertising. J Knowl Inf Syst 23(3):321–344

    Article  Google Scholar 

  9. Frommer A, Maaß P (1999) Fast CG-based methods for Tikhonov-Phillips regularization. J Sci Comput 20(5):1831–1850

    MATH  MathSciNet  Google Scholar 

  10. Gao J, Andrew G (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of international conference on machine learning, pp 33–40

  11. Gao J, Andrew G, Johnson M, Toutanova K (2007) A comparative study of parameter estimation methods for statistical natural language processing. In: Proceedings of the annual meeting of the association of computational linguistics, pp 824–831

  12. Giménez J, Márquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on, language resources and evaluation, pp 43–46

  13. Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of international conference on machine learning, pp 408–415

  14. Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 217–226

  15. Joachims T, Finley T, Yu CN (2009) Cutting-plane training of structural SVMs. Mach Learn 77(1):27–59

    Article  MATH  Google Scholar 

  16. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. J Knowl Inf Syst 22(3):371–391

    Article  Google Scholar 

  17. Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear SVMs. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 408–416

  18. Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear SVMs. J Mach Learn Res 6:341–361

    MATH  MathSciNet  Google Scholar 

  19. Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proceeding of the North American chapter of the association for computational linguistics on language technologies, pp 192–199

  20. Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of conference on empirical methods in natural language processing, pp 230–237

  21. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of international conference on machine learning, pp 282–289

  22. Lee YS, Wu YC (2007) A robust multilingual portable phrase chunking system. Expert Syst Appl 33(3): 1–26

    MATH  Google Scholar 

  23. Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177

    MATH  MathSciNet  Google Scholar 

  24. Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  25. Nivre J, Hall J, Kubler S, Mcdonald R, Nilsson J, Riedel S, Yuret D (2007) The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 915–932

  26. Ng HT, Low JK (2004) Chinese part-of-speech tagging. One-at-a-time or all-at-once? word-based or character-based?. In: Proceedings of conference on empirical methods in natural language processing, pp 277–284

  27. Suzuki J, Fujino A, Isozaki H (2007) Semi-supervised structured output learning based on a hybrid generative and discriminative approach. In: Proceedings of the annual meeting of the association of computational linguistics, pp 791–800

  28. Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the annual meeting of the association of computational linguistics, pp 665–673

  29. Tsai RTH (2010) Chinese text segmentation: a hybrid approach using transductive learning and statistical association measures. Expert Syst Appl 37(5):3553–3560

    Article  Google Scholar 

  30. Tjong Kim Sang EF, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: chunking. In: Proceedings of the conference on computational natural language learning, pp 127–132

  31. Wu YC, Lee YS, Yang JC (2008) Robust and efficient Chinese word dependency analysis with linear kernel support vector machines. In: Proceedings of international conference on computational linguistics poster session, pp 135–138

  32. Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the annual meeting of the association of computational linguistics, pp 840–847

  33. Zhang T, Damerau F, Johnson DE (2002) Text chunking based on a generalization of Winnow. J Mach Learn Res 2:615–637

    Google Scholar 

  34. Zhao H, Kit C (2007) Incorporating global information into supervised learning for Chinese word segmentation. In: Proceedings of the conference of the pacific association for computational linguistics, pp 66–74

Download references

Acknowledgments

The authors acknowledge support under. NSC Grants NSC 101-2221-E-130-027- and NSC 101-2622-E-130-006-CC3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Chieh Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, YC. A sparse \({\varvec{L}}_{2}\)-regularized support vector machines for efficient natural language learning. Knowl Inf Syst 39, 305–328 (2014). https://doi.org/10.1007/s10115-013-0615-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0615-0

Keywords

Navigation