Skip to main content

Text Chunking to Improve Website Classification

  • Conference paper
  • First Online:
Optimization, Learning Algorithms and Applications (OL2A 2023)

Abstract

Website classification is a crucial task in various applications such as web search, content filtering, and recommendation systems. Effectively categorizing long web pages into different categories based on their content is essential for providing accurate and personalized user experiences. Traditional transformer-based models, such as BERT and RoBERTa, have significantly advanced the field of natural language processing. However, such models face limitations when handling long sequences due to their fixed-length input restrictions resulting from their quadratic complexity. This paper presents a simple weighted stratified split approach (WSSA), to address the limitations of BERT and RoBERTa, in processing long text sequences for website classification. WSSA consists into chunking web pages into smaller chunks, then a new train chunk dataset is generated by a weighted stratified split following the distribution of the categories in the whole chunk dataset. This train chunk dataset is then used to train the models. Our approach improves the accuracy of BERT and RoBERTa models, surpassing the performance of Longformer and BigBird models. The proposed solution enables efficient processing and data augmentation, with reasonable fine-tuning times for BERT and RoBERTa models. Inference times remain efficient, showcasing the practicality of these models in real-time website classification tasks. The combination of WSSA with the index web page performs exceptionally well, highlighting its effectiveness in addressing the long text sequence limitation and improving transformer-based models for website classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.olfeo.com/ (visited on: 07/07/2022).

  2. 2.

    https://romeo.univ-reims.fr/.

References

  1. Bartík, V.: Text-based web page classification with use of visual information. In: 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 416–420. IEEE (2010)

    Google Scholar 

  2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020)

    Google Scholar 

  3. Choromanski, K., et al.: Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020)

  4. Cochran, W.G.: The comparison of percentages in matched samples. Biometrika 37(3/4), 256–266 (1950)

    Article  MathSciNet  Google Scholar 

  5. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988 (2019)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Espinosa-Leal, L., Akusok, A., Lendasse, A., Björk, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9_5

    Chapter  Google Scholar 

  8. Janaki Meena, M., Chandran, K., Karthik, A., Vijay Samuel, A.: A parallel ACO algorithm to select terms to categorise longer documents. Int. J. Comput. Sci. Eng. 6(4), 238–248 (2011)

    Google Scholar 

  9. Kitaev, N., Kaiser, u., Levskaya, A.: Adaptive attention span in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2911–2922 (2020)

    Google Scholar 

  10. Kitaev, N., Kaiser, u., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (2020)

    Google Scholar 

  11. Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., Bindhumadhava, B.: Phishing website classification and detection using machine learning. In: 2020 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2020). https://doi.org/10.1109/ICCCI48352.2020.9104161

  12. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2014)

    Google Scholar 

  13. Liu, Y., et al.: ROBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692

  14. Meena, M.J., Chandran, K., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Exp. Syst. Appl. 39(5), 5861–5871 (2012)

    Article  Google Scholar 

  15. Mohammad, R.M., Thabtah, F., McCluskey, L.: Intelligent rule-based phishing websites classification. IET Inf. Secur. 8(3), 153–160 (2014)

    Article  Google Scholar 

  16. Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Exp. Syst. Appl. 38(4), 3407–3415 (2011)

    Article  Google Scholar 

  17. Panwar, A., Onut, I.-V., Miller, J.: Towards real time contextual advertising. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8787, pp. 445–459. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11746-1_33

    Chapter  Google Scholar 

  18. Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 228–237 (2006)

    Google Scholar 

  19. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 1–31 (2009)

    Article  Google Scholar 

  20. Qiu, J., Ma, H., Levy, O., Yih, W., Wang, S., Tang, J.: Blockwise self-attention for long document understanding (2019)

    Google Scholar 

  21. Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. In: International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  22. Reitermanova, Z.: Data splitting. In: WDS, vol. 10, pp. 31–36. MatfyzPress, Prague (2010)

    Google Scholar 

  23. Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997 (2020)

  24. Shabudin, S., Sani, N.S., Ariffin, K.A.Z., Aliff, M.: Feature selection for phishing website classification. Int. J. Adv. Comput. Sci. Appl. 11(4) (2020)

    Google Scholar 

  25. Vaghela, S.D., Patel, P.: Web page classification techniques - a comprehensive survey. IJIRSET 6, 17472–17479 (2014)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  27. Wang, S., Li, Z., Khabsa, M., Fang, H., Ma, H., Tang, J.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)

  28. Ye, Z., Guo, Q., Gan, Q., Qiu, X., Zhang, Z.: BP-transformer: modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 (2019)

  29. Zaheer, M., et al.: Big bird: transformers for longer sequences (2020)

    Google Scholar 

  30. Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625 (2011)

    Google Scholar 

Download references

Acknowledgement

This work is part of the RAPID project METIS which was funded by the French Ministry of the Armed Forces, Defence Innovation Agency (Reference number: 202906117).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Zohir Koufi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koufi, M.Z., Guessoum, Z., Keziou, A., Yahiaoui, I., Martineau, C., Domin, W. (2024). Text Chunking to Improve Website Classification. In: Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J. (eds) Optimization, Learning Algorithms and Applications. OL2A 2023. Communications in Computer and Information Science, vol 1981. Springer, Cham. https://doi.org/10.1007/978-3-031-53025-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53025-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53024-1

  • Online ISBN: 978-3-031-53025-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics