Skip to main content
Log in

Ensemble approach for web page classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Over the decades World Wide Web has become abundance source of distributed web content repository hyper-linked with diverse information domains. Performance of search engines in locating the information is exemplary but still there is inadequacy in search engines for focused crawling of web content. Web Page Classification being pivotal for information retrieval and management task plays imperative role for natural language processing in creating classified web document repositories and building indexed web directories. The conventional machine learning approaches extract the desired features from web pages in order to classify them whereas deep leaning algorithms learns the covet features as the network goes deeper and deeper. Transfer learning based Pre-trained models such as BERT attains impressive performance for text classification. In this study, we evaluate the effectiveness of adopting pre-trained model BERT for the task of classifying web pages into different categories. In this paper, we proposed an ensemble approach for web page classification by learning contextual representation using pre-trained bidirectional BERT and then applying deep Inception modelling with Residual connections for fine-tunes the target task by utilizing parallel multi-scale semantics. Experimental evaluation exhibit that proposed ensemble model outperforms benchmark baselines and achieve better performance in contrast to other transfer learning approaches evaluated on the web page classification task for different classification datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Altingövde IS, Özel SA, Lusoy Ö, Özsoyoglu G, Özsoyoglu ZM (2001) Topic-centric querying of Web information resources. Lecture Notes Comput Sci 2113:699–711

    Article  Google Scholar 

  2. Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56(18):3825–3833. https://doi.org/10.1016/j.comnet.2012.10.007

    Article  Google Scholar 

  3. Chen RC, Hsieh CH (2006) Web page classification based on a support vector machine using a weighted vote schema. Expert Syst Appl 31:427–435

    Article  Google Scholar 

  4. Chung J, et al. (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  5. Conneau A, et al. (2016) Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781

  6. De Bra PME, Post RDJ (1994) Information retrieval in the world wide web: making client-based searching feasible. Comput Netw ISDN Syst 27(2):183–192

    Article  Google Scholar 

  7. Devlin J, et al. (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  8. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, (pp. 770–778)

  9. Holden N, Freitas A A, (2004) Web Page classification with an ant Colony algorithm, parallel problem solving from nature, LNCS, springer, Vol.3242, (pp. 1092-1102)

  10. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 328-339)

  11. Huang M, Qian Q, Zhu X (2017) Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans Inf Syst (TOIS) 35(3):26

    Article  Google Scholar 

  12. Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 655-665)

  13. Kim Y (2014) Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp.  1746–1751)

  14. Kwon O, Lee J (2000) Web page classification based on k-nearest neighbour approach. IRAL '00: Proceedings of the fifth international workshop on Information retrieval with Asian languages (pp. 9–15)

  15. Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56

    Article  Google Scholar 

  16. Li Y et al (2020) SCANET: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Trans Sens Netw 16(3):29:1–29:27

    Article  Google Scholar 

  17. Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419

    Article  Google Scholar 

  18. Meshkizadeh S, Rahmani AM, Dezfuli MA (2010) Web Page Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages. International Journal of Advancements in Computing Technology 2:36–46

    Article  Google Scholar 

  19. Ozel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415

    Article  Google Scholar 

  20. Ozel SA (2011) A genetic algorithm based optimal feature selection for web page classification. In Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, IEEE, (pp. 282–286)

  21. Peters ME, et al. (2018) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (pp. 2227–2237)

  22. Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Computing Surveys 41(2):article 12

  23. Radford A, et al. (2018) Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/languageunderstandingpaper.pdf

  24. Ribeiro A, Fresno V, Garcia-Alegre MC, Guinea D (2003) Web page classification: a soft computing approach. Lecture Notes Artif Intell 2663:103–112

    Google Scholar 

  25. Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158:69–88

    Article  MathSciNet  Google Scholar 

  26. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (pp. 4278–4284)

  27. Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1-9)

  28. Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Vol. 1, (pp. 1556–1566)

  29. Wang B (2018) Disconnected recurrent neural networks for text categorization. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 2311–2320)

  30. Xiao Y, Cho K (2016) Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv:1602.00367. https://arxiv.org/abs/1602.00367

  31. Yang Z, et al. (2016) Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp 1480–1489)

  32. Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems , Vol. 1, (pp. 649–657)

  33. Zhou C, et al. (2015) A C-LSTM neural network for text classification. arXiv:1511.08630. http://arxiv.org/abs/1511.08630

  34. Zhou G, et al. (2018) CNNAuth: continuous authentication via two-stream convolutional neural networks. IEEE 13th Int Conf, NAS: 1-9

  35. Zhou G et al (2019) Using data augmentation in continuous authentication on smartphones. IEEE Internet Things J 6(1):628–640

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Google Colaboratory for providing free-of-cost TPU for performing our experimentation on efficient web page classification.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, A., Bhatia, R. Ensemble approach for web page classification. Multimed Tools Appl 80, 25219–25240 (2021). https://doi.org/10.1007/s11042-021-10891-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10891-3

Keywords

Navigation