Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content

Novozhilov, Dmitry; Kotenko, Igor; Chechulin, Andrey

doi:10.1007/978-3-319-25017-5_24

Dmitry Novozhilov⁷,
Igor Kotenko⁷ &
Andrey Chechulin⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 616))

782 Accesses
6 Citations

Abstract

The paper considers the problem of improving the quality of web sites categorization using data mining methods. This goal is important for automated systems of parental control. The purpose of such systems is protection from unwanted or inappropriate information. The novelty of the proposed approach is in usage of HTML tags statistics of web pages to improve the categorization of sites that are similar in terms of textual content, but differing in their structural features. The paper describes the architecture of the categorization system, the algorithm of its work, the results of experiments, and assessment of classification quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: ECML-98, LNCS, vol. 1398, pp. 137–142. Springer (1998)
Google Scholar
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Coling’00, pp. 453–459. Morgan Kaufmann (2000)
Google Scholar
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: ACM, pp. 83–92 (2006)
Google Scholar
Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2000)
Article Google Scholar
Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: THAI’99, pp. 105–119 (1999)
Google Scholar
Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing E-Mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: KDD’09, pp. 1245–1254. ACM (2009)
Google Scholar
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: ICIKM 2005, ACM (2005)
Google Scholar
Geide, M.: N-gram Character Sequence Analysis of Benign vs. Malicious Domains/URLs. Available at http://analysis-manifold.com/ Accessed 24 March 2015
Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using html features and url features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)
Google Scholar
Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive bayesian algorithm. In: IMECS2012, vol. 1, p. 466 (2012)
Google Scholar
Riboni, D. Feature selection for web page classification. In: EURASIA-ICT-2002 (2002)
Google Scholar
Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. LNAI 8557, 39–54 (2014)
Google Scholar
URLBlacklist.com.: http://urlblacklist.com/ Accessed 24 March 2015
Shalla Secure Services KG.: http://www.shallalist.de/ Accessed 24 March 2015

Download references

Acknowledgment

This research is being supported by The Ministry of Education and Science of The Russian Federation (contract # 14.604.21.0147, unique contract identifier RFMEFI60414X0147).

Author information

Authors and Affiliations

Laboratory of Computer Security Problems, St. Petersburg Institute for Informatics and Automation (SPIIRAS), 39, 14 Linija, St. Petersburg, Russia
Dmitry Novozhilov, Igor Kotenko & Andrey Chechulin

Authors

Dmitry Novozhilov
View author publications
You can also search for this author in PubMed Google Scholar
Igor Kotenko
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Chechulin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Kotenko .

Editor information

Editors and Affiliations

Departamento de Informática/Centro ALGORITMI, Escola de Engenharia, Universidade do Minho, Braga, Portugal
Paulo Novais
Computer Science Department, Universidad Autónoma De Madrid, Madrid, Spain
David Camacho
Departamento de Informática/Centro ALGORITMI, Escola de Engenharia, Universidade do Minho, Braga, Portugal
Cesar Analide
LIP6, University Pierre and Marie Curie 4, Paris Codex 05, France
Amal El Fallah Seghrouchni
Software Engineering Department, Faculty of Automatics, Computers and Electronics, University of Craiova, Craiova, Romania
Costin Badica

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Novozhilov, D., Kotenko, I., Chechulin, A. (2016). Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds) Intelligent Distributed Computing IX. Studies in Computational Intelligence, vol 616. Springer, Cham. https://doi.org/10.1007/978-3-319-25017-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-25017-5_24
Published: 18 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25015-1
Online ISBN: 978-3-319-25017-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics