Skip to main content

PageCNNs: Convolutional Neural Networks for Multi-label Chinese Webpage Classification with Multi-information Fusion

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

Along with the popularity and development of the Internet in China, Chinese webpage classification has become an important research topic. As the webpage text is a kind of text, webpage classification is constructed based on text classification. But due the particularity of the webpage composition, the external linked webpages can leverage helpful information to improve the webpage classification performance. The goal of this work is to design accurate multi-label Chinese webpage classification models by effectively fusing the information extracted from current webpage and external linked webpages, including the text information and label information of external linked webpages. A convolutional neural network for webpage classification (PageCNN) model and its two variants (PageCNN-CLL and PageCNN-WLL) are proposed to effectively fuse the text and label information extracted from multiple Chinese webpages. The proposed PageCNN models are compared with two modified traditional machine learning models, the modified TextCNN model, and three state-of-the-art deep learning based multi-label text classification models. The experimental results demonstrate that the PageCNN models perform better than the compared models in terms of subset accuracy, Hamming loss, macro F1, and micro F1. Moreover, the in-depth analysis of the effectiveness of the external linked webpages on current webpage classification is conducted by analyzing the error correction rate and hit rate of the proposed models and preliminary prediction variables. As demonstrated in the experiments, the multi-information fusion methods developed in the PageCNN models can effectively manipulate the input data from multiple webpages to enhance the multi-label Chinese webpage classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, G., Ye, D., Xing, Z., Chen, J., Cambria, E.: Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: Proceedings of International Joint Conference on Neural Network, pp. 2377–2383 (2017)

    Google Scholar 

  2. Fang, L., Zhang, L., Wu, H., Xu, T., Zhou, D., Chen, E.: Patent2vec: Multi-view representation learning on patent-graphs for patent classification. World Wide Web 24, 1791–1812 (2021)

    Article  Google Scholar 

  3. Zou, J.-q., Chen, G.-l., Guo, W.-z.: Chinese web page classification using noise-tolerant support vector machines. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engeering, pp. 785–790 (2005)

    Google Scholar 

  4. Liang, J.-z.: Chinese web page classification based on self-organizing mapping neural networks. In: Procedings of International Conference on Computer Intelligent and Multimedia Application, pp. 96–101 (2003)

    Google Scholar 

  5. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of Conference on Empirical Methods in Natural Language Process, pp. 1746–1751 (2014)

    Google Scholar 

  6. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (2015)

    Google Scholar 

  7. Kurata, G., Xiang, B., Zhou, B.: Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In: Proceeedings of the Conference of the North American Chapter of the Association for Computer Linguistics: Human Language Technology, pp. 521–526 (2016)

    Google Scholar 

  8. Li, S., Zhao, Z., Hu, R., Li, W., Liu, T., Du, X.: Analogical reasoning on Chinese morphological and semantic relations. In: Proceedings of the Annual Meeting of the Association for Computer Linguistics, vol. 2, pp. 138–143 (2018)

    Google Scholar 

  9. Liao, W., Wang, Y., Yin, Y., Zhang, X., Ma, P.: Improved sequence generation model for multi-label classification via CNN and initialized fully connection. Neurocomputing 382, 188–195 (2020)

    Article  Google Scholar 

  10. Lin, J., Su, Q., Yang, P., Ma, S., Sun, X.: Semantic-unit-based dilated convolution for multi-label text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Process, pp. 4554–4564 (2018)

    Google Scholar 

  11. Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2017)

    Google Scholar 

  12. Ma, H., Li, Y., Ji, X., Han, J., Li, Z.: Mscoa: multi-step co-attention model for multi-label classification 7, 109635–109645 (2019)

    Google Scholar 

  13. Nam, J., Loza Mencía, E., Kim, H.J., Fürnkranz, J.: Maximizing subset accuracy with recurrent neural networks in multi-label classification. In: Proceedings of Advance in Neural Information Processing System, pp. 5413–5423 (2017)

    Google Scholar 

  14. Sang, J., Wang, Y., Yuan, L., Li, H., Jiang, X.: Multi-label transfer learning via latent graph alignment. World Wide Web 25, 879–898 (2022)

    Article  Google Scholar 

  15. Wang, X., et al.: Research and implementation of a multi-label learning algorithm for Chinese text classification. In: Proceedings of International Conference on Big Data Computation and Communication, pp. 68–76 (2017)

    Google Scholar 

  16. Wu, Y., Pei, C., Ruan, C., Wang, R., Yang, Y., Zhang, Y.: Bayesian networks and chained classifiers based on svm for traditional chinese medical prescription generation. World Wide Web 25, 1447–1468 (2022)

    Article  Google Scholar 

  17. Xiao, L., Huang, X., Chen, B., Jing, L.: Label-specific document representation for multi-label text classification. In: Proceedings of Conference on Empirical Methods in Natural Language Processing International Joint Conference on Natural Language Process, pp. 466–475 (2019)

    Google Scholar 

  18. Yang, P., Luo, F., Ma, S., Lin, J., Sun, X.: A deep reinforced sequence-to-set model for multi-label classification. In: Proceedings of the Annual Meeting of the Association for Computer and Linguistics, pp. 5252–5258 (2019)

    Google Scholar 

  19. Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: sequence generation model for multi-label classification. In: Proceedings of the International Conference on Computer Linguistics, pp. 3915–3926 (2018)

    Google Scholar 

  20. Yang, Z., Liu, G.: Hierarchical sequence-to-sequence model for multi-label text classification 7, 153012–153020 (2019)

    Google Scholar 

  21. Yeh, C.K., Wu, W.C., Ko, W.J., Wang, Y.C.F.: Learning deep latent spaces for multi-label classification. Proc. AAAI Conf. Artif. Intell. 31(1), 838–2844 (2017)

    Google Scholar 

  22. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms 26(8), 1819–1837 (2013)

    Google Scholar 

  23. Zhong, T., Liu, F., Zhou, F., Trajcevski, G., Zhang, K.: Motion based inference of social circles via self-attention and contextualized embedding 7, 61934–61948 (2019)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Guangdong Natural Science Foundation (2021A1515012651), National Natural Science Foundation of China (62076100), Science and Technology Planning Project of Guangdong Province (2020B0101100002), Fundamental Research Funds for the Central Universities (SCUT) (x2rjD2220050), and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junying Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, J., Chen, J., Cai, Y. (2024). PageCNNs: Convolutional Neural Networks for Multi-label Chinese Webpage Classification with Multi-information Fusion. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2387-4_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2386-7

  • Online ISBN: 978-981-97-2387-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics