Abstract
Along with the popularity and development of the Internet in China, Chinese webpage classification has become an important research topic. As the webpage text is a kind of text, webpage classification is constructed based on text classification. But due the particularity of the webpage composition, the external linked webpages can leverage helpful information to improve the webpage classification performance. The goal of this work is to design accurate multi-label Chinese webpage classification models by effectively fusing the information extracted from current webpage and external linked webpages, including the text information and label information of external linked webpages. A convolutional neural network for webpage classification (PageCNN) model and its two variants (PageCNN-CLL and PageCNN-WLL) are proposed to effectively fuse the text and label information extracted from multiple Chinese webpages. The proposed PageCNN models are compared with two modified traditional machine learning models, the modified TextCNN model, and three state-of-the-art deep learning based multi-label text classification models. The experimental results demonstrate that the PageCNN models perform better than the compared models in terms of subset accuracy, Hamming loss, macro F1, and micro F1. Moreover, the in-depth analysis of the effectiveness of the external linked webpages on current webpage classification is conducted by analyzing the error correction rate and hit rate of the proposed models and preliminary prediction variables. As demonstrated in the experiments, the multi-information fusion methods developed in the PageCNN models can effectively manipulate the input data from multiple webpages to enhance the multi-label Chinese webpage classification performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, G., Ye, D., Xing, Z., Chen, J., Cambria, E.: Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: Proceedings of International Joint Conference on Neural Network, pp. 2377–2383 (2017)
Fang, L., Zhang, L., Wu, H., Xu, T., Zhou, D., Chen, E.: Patent2vec: Multi-view representation learning on patent-graphs for patent classification. World Wide Web 24, 1791–1812 (2021)
Zou, J.-q., Chen, G.-l., Guo, W.-z.: Chinese web page classification using noise-tolerant support vector machines. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engeering, pp. 785–790 (2005)
Liang, J.-z.: Chinese web page classification based on self-organizing mapping neural networks. In: Procedings of International Conference on Computer Intelligent and Multimedia Application, pp. 96–101 (2003)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of Conference on Empirical Methods in Natural Language Process, pp. 1746–1751 (2014)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (2015)
Kurata, G., Xiang, B., Zhou, B.: Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In: Proceeedings of the Conference of the North American Chapter of the Association for Computer Linguistics: Human Language Technology, pp. 521–526 (2016)
Li, S., Zhao, Z., Hu, R., Li, W., Liu, T., Du, X.: Analogical reasoning on Chinese morphological and semantic relations. In: Proceedings of the Annual Meeting of the Association for Computer Linguistics, vol. 2, pp. 138–143 (2018)
Liao, W., Wang, Y., Yin, Y., Zhang, X., Ma, P.: Improved sequence generation model for multi-label classification via CNN and initialized fully connection. Neurocomputing 382, 188–195 (2020)
Lin, J., Su, Q., Yang, P., Ma, S., Sun, X.: Semantic-unit-based dilated convolution for multi-label text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Process, pp. 4554–4564 (2018)
Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2017)
Ma, H., Li, Y., Ji, X., Han, J., Li, Z.: Mscoa: multi-step co-attention model for multi-label classification 7, 109635–109645 (2019)
Nam, J., Loza Mencía, E., Kim, H.J., Fürnkranz, J.: Maximizing subset accuracy with recurrent neural networks in multi-label classification. In: Proceedings of Advance in Neural Information Processing System, pp. 5413–5423 (2017)
Sang, J., Wang, Y., Yuan, L., Li, H., Jiang, X.: Multi-label transfer learning via latent graph alignment. World Wide Web 25, 879–898 (2022)
Wang, X., et al.: Research and implementation of a multi-label learning algorithm for Chinese text classification. In: Proceedings of International Conference on Big Data Computation and Communication, pp. 68–76 (2017)
Wu, Y., Pei, C., Ruan, C., Wang, R., Yang, Y., Zhang, Y.: Bayesian networks and chained classifiers based on svm for traditional chinese medical prescription generation. World Wide Web 25, 1447–1468 (2022)
Xiao, L., Huang, X., Chen, B., Jing, L.: Label-specific document representation for multi-label text classification. In: Proceedings of Conference on Empirical Methods in Natural Language Processing International Joint Conference on Natural Language Process, pp. 466–475 (2019)
Yang, P., Luo, F., Ma, S., Lin, J., Sun, X.: A deep reinforced sequence-to-set model for multi-label classification. In: Proceedings of the Annual Meeting of the Association for Computer and Linguistics, pp. 5252–5258 (2019)
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: sequence generation model for multi-label classification. In: Proceedings of the International Conference on Computer Linguistics, pp. 3915–3926 (2018)
Yang, Z., Liu, G.: Hierarchical sequence-to-sequence model for multi-label text classification 7, 153012–153020 (2019)
Yeh, C.K., Wu, W.C., Ko, W.J., Wang, Y.C.F.: Learning deep latent spaces for multi-label classification. Proc. AAAI Conf. Artif. Intell. 31(1), 838–2844 (2017)
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms 26(8), 1819–1837 (2013)
Zhong, T., Liu, F., Zhou, F., Trajcevski, G., Zhang, K.: Motion based inference of social circles via self-attention and contextualized embedding 7, 61934–61948 (2019)
Acknowledgements
This work was supported by Guangdong Natural Science Foundation (2021A1515012651), National Natural Science Foundation of China (62076100), Science and Technology Planning Project of Guangdong Province (2020B0101100002), Fundamental Research Funds for the Central Universities (SCUT) (x2rjD2220050), and CAAI-Huawei MindSpore Open Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zheng, J., Chen, J., Cai, Y. (2024). PageCNNs: Convolutional Neural Networks for Multi-label Chinese Webpage Classification with Multi-information Fusion. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_14
Download citation
DOI: https://doi.org/10.1007/978-981-97-2387-4_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2386-7
Online ISBN: 978-981-97-2387-4
eBook Packages: Computer ScienceComputer Science (R0)