Abstract
The paper considers the problem of improving the quality of web sites categorization using data mining methods. This goal is important for automated systems of parental control. The purpose of such systems is protection from unwanted or inappropriate information. The novelty of the proposed approach is in usage of HTML tags statistics of web pages to improve the categorization of sites that are similar in terms of textual content, but differing in their structural features. The paper describes the architecture of the categorization system, the algorithm of its work, the results of experiments, and assessment of classification quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: ECML-98, LNCS, vol. 1398, pp. 137–142. Springer (1998)
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Coling’00, pp. 453–459. Morgan Kaufmann (2000)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: ACM, pp. 83–92 (2006)
Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2000)
Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: THAI’99, pp. 105–119 (1999)
Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing E-Mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: KDD’09, pp. 1245–1254. ACM (2009)
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: ICIKM 2005, ACM (2005)
Geide, M.: N-gram Character Sequence Analysis of Benign vs. Malicious Domains/URLs. Available at http://analysis-manifold.com/ Accessed 24 March 2015
Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using html features and url features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)
Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive bayesian algorithm. In: IMECS2012, vol. 1, p. 466 (2012)
Riboni, D. Feature selection for web page classification. In: EURASIA-ICT-2002 (2002)
Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. LNAI 8557, 39–54 (2014)
URLBlacklist.com.: http://urlblacklist.com/ Accessed 24 March 2015
Shalla Secure Services KG.: http://www.shallalist.de/ Accessed 24 March 2015
Acknowledgment
This research is being supported by The Ministry of Education and Science of The Russian Federation (contract # 14.604.21.0147, unique contract identifier RFMEFI60414X0147).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Novozhilov, D., Kotenko, I., Chechulin, A. (2016). Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds) Intelligent Distributed Computing IX. Studies in Computational Intelligence, vol 616. Springer, Cham. https://doi.org/10.1007/978-3-319-25017-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-25017-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25015-1
Online ISBN: 978-3-319-25017-5
eBook Packages: EngineeringEngineering (R0)