Skip to main content
Log in

Hierarchical classification with a topic taxonomy via LDA

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Large scale hierarchical classification problem researches how to classify documents into a predefined taxonomy with thousands of categories. As the skewed category distribution over documents, that is, most categories have very few labeled documents, the data sparseness problem in the rare categories lead to a low classification performance. In this paper, we study the problem of web-page classification over the topic taxonomy of the DMOZ directory. For this hard task, we proposed a hierarchical classification model based on Latent Dirichlet allocation (LDA). We use LDA model as the feature extraction technique to extract latent topics to reduce the effects of data sparseness, and construct topic feature vectors associated with the corpus for training more robust classification models for rare categories. Experiments were conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that our method achieves a performance improvement for rare categories over the hierarchical classification methods based on full-term and feature-word, and further improves the performance over the whole topic taxonomy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.dmoz.org/

  2. http://dir.yahoo.com/

  3. http://mallet.cs.umass.edu/

References

  1. Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv:1003.0783

  2. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  3. Chen WJ, Shao Y-H, Hong N (2013) Laplacian smooth twin support vector machine for semi-supervised classification. Intern J Mach Learn Cyber. doi:10.1007/s13042-013-0183-3

  4. Fagni T, Sebastiani F (2007) On the selection of negative examples for hierarchical text categorization. In: Proceedings of the 3rd Language and Technology Conference (LTC07) pp 24–28

  5. Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  6. Gomez JC, Moens M-F (2012) Hierarchical classification of web documents by stratified discriminant analysis. In: Multidisciplinary Information Retrieval, Springer, pp 94–108

  7. Gopal S, Yang Y, Bai B, Niculescu-Mizil A (2012) Bayesian models for large-scale hierarchical classification. In: Advances in Neural Information Processing Systems 25: 2420–2428

  8. Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101(Suppl 1): 5228–5235

    Article  Google Scholar 

  9. He L, Jia Y, Han W, Tan S, Chen Z (2012) Research and development of large scale hierarchical classification problem. In: Chinese Journal of Computers pp 2101–2115

  10. He Q, Wu C (2011) Separating theorem of samples in banach space for support vector machine learning. Intern J Mach Learn Cybernet 2(1): 49–54

    Article  Google Scholar 

  11. Liu T, Yang Y, Wan H, Zeng H, Chen Z, Ma W (2005) Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explor Newslett 7(1):36–43

    Article  Google Scholar 

  12. Liu Z, Wu Q, Zhang Y, Chen CP (2011) Adaptive least squares support vector machines filter for hand tremor canceling in microsurgery. Intern J Mach Learn Cybernet 2(1):37–47

    Article  MathSciNet  Google Scholar 

  13. Madani O, Huang J (2010) Large-scale many-class prediction via flat techniques. In: Large-Scale Hierarchical Classification Workshop of ECIR

  14. Marath S (2010) Large-scale web page classification. Ph.D. thesis

  15. Oh H, Choi Y, Myaeng S (2010) Combining global and local information for enhanced deep classification. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ACM, pp 1760–1767

  16. Wang X, Lu SX, Zhai JH (2008) Fast fuzzy multi-category svm based on support vector domain description. Intern J Patt Recogn Artif Intell 22(1):109–120

    Article  Google Scholar 

  17. Xue G, Xing D, Yang Q, Yu Y (2008) Deep classification in large-scale text hierarchies. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 619–626

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li He.

Additional information

This work was supported by the National High Technology Research and Development Program of China (No. 2010AA012505, 2011AA010702, 2012AA01A401 and 2012AA01A402), Chinese National Science Foundation (No. 60933005, 91124002, 61303265), National Technology Support Foundation (No. 2012BAH38B04) and National 242 Foundation (No. 2011A010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, L., Jia, Y., Ding, Z. et al. Hierarchical classification with a topic taxonomy via LDA. Int. J. Mach. Learn. & Cyber. 5, 491–497 (2014). https://doi.org/10.1007/s13042-013-0203-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-013-0203-3

Keywords

Navigation