URL Classification Using Convolutional Neural Network for a New Large Dataset

Hung, Phan Duy; Hung, Nguyen Dinh; Diep, Vu Thu

doi:10.1007/978-3-031-16538-2_11

Phan Duy Hung⁸,
Nguyen Dinh Hung⁸ &
Vu Thu Diep⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13492))

Included in the following conference series:

International Conference on Cooperative Design, Visualization and Engineering

502 Accesses
2 Citations

Abstract

In today’s world, methods for real-time web page classification are in need due to the tremendous increase in the number of web pages and Internet usage of the people . To address these problems, in the literature, URL-based methods have been proposed which have advantages in classification speed and computational effectiveness over content-based approaches. This work proposes a CNN-based method using URLs only as input. We extract word-level tokens from the URLs alone, feed them into a word embedding layer and then hyper-tunned CNN layers. Our experiments demonstrate that this method can archive an F1-score of 0.9759 and outperforms many existing methods for a new large dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Qi, X., Davison, B.D.: Web page classification: Features and algorithms, ACM Comput. Surv. 41(2), 121–123 (2009)
Google Scholar
Kan, M.-Y.: Web page classification without the web page. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (WWW Alt. ‘04). Association for Computing Machinery, New York, NY, USA, pp. 262–263 (2004). https://doi.org/10.1145/1013367.1013426
Kan, M.-Y., Oanh, N.T.H.: Fast webpage classification using URL features. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM 2005). Association for Computing Machinery, New York, NY, USA, pp. 325–326 (2005). https://doi.org/10.1145/1099554.1099649
Baykan, E., Henzinger, M., Ludmila, M., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web 5(3), Article 15, p. 29 (2011). https://doi.org/10.1145/1993053.1993057
Rajalakshmi, R., Aravindan, C.: Naive Bayes approach for website classification. In: Das, V.V., Thomas, G., Lumban Gaol, F. (eds.) AIM 2011. CCIS, vol. 147, pp. 323–326. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20573-6_55
Chapter Google Scholar
Rajalakshmi, R.: Identifying health domain URLs using SVM. In: Proceedings of the Third International Symposium on Women in Computing and Informatics (WCI 2015). Association for Computing Machinery, New York, NY, USA, pp. 203–208 (2015). https://doi.org/10.1145/2791405.2791441
Rajalakshmi, R., Aravindan, C.: A Naive Bayes approach for URL classification with supervised feature selection and rejection framework. Comput. Intell. 34, 363–396 (2018). https://doi.org/10.1111/coin.12158
Article MathSciNet Google Scholar
Rajalakshmi, R., Tiwari, H., Patel, J., Kumar, A., Karthik, R.: Design of kids-specific URL classifier using recurrent convolutional neural network. Procedia Comput. Sci. 167, 2124–2131 (2020)
Article Google Scholar
Rajalakshmi, R., Ramraj, S., Ramesh Kannan, R.: Transfer learning approach for identification of malicious domain names. In: Thampi, S.M., Madria, S., Wang, G., Rawat, D.B., Alcaraz Calero, J.M. (eds.) SSCC 2018. CCIS, vol. 969, pp. 656–666. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-5826-5_51
Chapter Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar. Association for Computational Linguistics (2014)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'15). MIT Press, Cambridge, MA, USA, pp. 649–657 (2015)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016). AAAI Press, pp. 2741–2749 (2016)
Google Scholar
https://coccoc.com/en/about-us. Accessed 5 May 2022
Jose, C.-C., Mohammad, T.P.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. In: Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 40–46 (2018)
Google Scholar
https://github.com/trungtv/pyvi. Accessed 5 May 2022
https://keras.io/api/layers/core_layers/embedding/. Accessed 5 May 2022
Stanford University lecture on machine learning. https://cs230.stanford.edu/section/8/. Accessed 5 May 2022
Hung, P.D., Loan, B.T.: Automatic Vietnamese passport recognition on android phones. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 476–485. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_36
Chapter Google Scholar
Quan, D.V., Hung, P.D.: Application of customized term frequency-inverse document frequency for Vietnamese document classification in place of lemmatization. In: Vasant, P., Zelinka, I., Weber, G.-W. (eds.) ICO 2020. AISC, vol. 1324, pp. 406–417. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68154-8_37
Chapter Google Scholar
Hung, P.D., Minh, N.C.: Application of fuzzy logic in university suggestion system for Vietnamese high school students. In: Dang, T.K., Küng, J., Takizawa, M., Bui, S.H. (eds.) FDSE 2019. LNCS, vol. 11814, pp. 656–664. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35653-8_44
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

FPT University, Hanoi, Vietnam
Phan Duy Hung & Nguyen Dinh Hung
Hanoi University of Science and Technology, Hanoi, Vietnam
Vu Thu Diep

Authors

Phan Duy Hung
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Dinh Hung
View author publications
You can also search for this author in PubMed Google Scholar
Vu Thu Diep
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vu Thu Diep .

Editor information

Editors and Affiliations

University of Balearic Islands, Palma de Mallorca, Spain
Yuhua Luo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hung, P.D., Hung, N.D., Diep, V.T. (2022). URL Classification Using Convolutional Neural Network for a New Large Dataset. In: Luo, Y. (eds) Cooperative Design, Visualization, and Engineering. CDVE 2022. Lecture Notes in Computer Science, vol 13492. Springer, Cham. https://doi.org/10.1007/978-3-031-16538-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-16538-2_11
Published: 20 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16537-5
Online ISBN: 978-3-031-16538-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics