Abstract
Large scale hierarchical classification problem researches how to classify documents into a predefined taxonomy with thousands of categories. As the skewed category distribution over documents, that is, most categories have very few labeled documents, the data sparseness problem in the rare categories lead to a low classification performance. In this paper, we study the problem of web-page classification over the topic taxonomy of the DMOZ directory. For this hard task, we proposed a hierarchical classification model based on Latent Dirichlet allocation (LDA). We use LDA model as the feature extraction technique to extract latent topics to reduce the effects of data sparseness, and construct topic feature vectors associated with the corpus for training more robust classification models for rare categories. Experiments were conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that our method achieves a performance improvement for rare categories over the hierarchical classification methods based on full-term and feature-word, and further improves the performance over the whole topic taxonomy.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv:1003.0783
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chen WJ, Shao Y-H, Hong N (2013) Laplacian smooth twin support vector machine for semi-supervised classification. Intern J Mach Learn Cyber. doi:10.1007/s13042-013-0183-3
Fagni T, Sebastiani F (2007) On the selection of negative examples for hierarchical text categorization. In: Proceedings of the 3rd Language and Technology Conference (LTC07) pp 24–28
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Gomez JC, Moens M-F (2012) Hierarchical classification of web documents by stratified discriminant analysis. In: Multidisciplinary Information Retrieval, Springer, pp 94–108
Gopal S, Yang Y, Bai B, Niculescu-Mizil A (2012) Bayesian models for large-scale hierarchical classification. In: Advances in Neural Information Processing Systems 25: 2420–2428
Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101(Suppl 1): 5228–5235
He L, Jia Y, Han W, Tan S, Chen Z (2012) Research and development of large scale hierarchical classification problem. In: Chinese Journal of Computers pp 2101–2115
He Q, Wu C (2011) Separating theorem of samples in banach space for support vector machine learning. Intern J Mach Learn Cybernet 2(1): 49–54
Liu T, Yang Y, Wan H, Zeng H, Chen Z, Ma W (2005) Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explor Newslett 7(1):36–43
Liu Z, Wu Q, Zhang Y, Chen CP (2011) Adaptive least squares support vector machines filter for hand tremor canceling in microsurgery. Intern J Mach Learn Cybernet 2(1):37–47
Madani O, Huang J (2010) Large-scale many-class prediction via flat techniques. In: Large-Scale Hierarchical Classification Workshop of ECIR
Marath S (2010) Large-scale web page classification. Ph.D. thesis
Oh H, Choi Y, Myaeng S (2010) Combining global and local information for enhanced deep classification. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ACM, pp 1760–1767
Wang X, Lu SX, Zhai JH (2008) Fast fuzzy multi-category svm based on support vector domain description. Intern J Patt Recogn Artif Intell 22(1):109–120
Xue G, Xing D, Yang Q, Yu Y (2008) Deep classification in large-scale text hierarchies. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 619–626
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National High Technology Research and Development Program of China (No. 2010AA012505, 2011AA010702, 2012AA01A401 and 2012AA01A402), Chinese National Science Foundation (No. 60933005, 91124002, 61303265), National Technology Support Foundation (No. 2012BAH38B04) and National 242 Foundation (No. 2011A010).
Rights and permissions
About this article
Cite this article
He, L., Jia, Y., Ding, Z. et al. Hierarchical classification with a topic taxonomy via LDA. Int. J. Mach. Learn. & Cyber. 5, 491–497 (2014). https://doi.org/10.1007/s13042-013-0203-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-013-0203-3