Skip to main content

Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content

  • Conference paper
  • First Online:
Intelligent Distributed Computing IX

Part of the book series: Studies in Computational Intelligence ((SCI,volume 616))

Abstract

The paper considers the problem of improving the quality of web sites categorization using data mining methods. This goal is important for automated systems of parental control. The purpose of such systems is protection from unwanted or inappropriate information. The novelty of the proposed approach is in usage of HTML tags statistics of web pages to improve the categorization of sites that are similar in terms of textual content, but differing in their structural features. The paper describes the architecture of the categorization system, the algorithm of its work, the results of experiments, and assessment of classification quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: ECML-98, LNCS, vol. 1398, pp. 137–142. Springer (1998)

    Google Scholar 

  2. Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Coling’00, pp. 453–459. Morgan Kaufmann (2000)

    Google Scholar 

  3. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: ACM, pp. 83–92 (2006)

    Google Scholar 

  4. Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2000)

    Article  Google Scholar 

  5. Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: THAI’99, pp. 105–119 (1999)

    Google Scholar 

  6. Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing E-Mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)

    Google Scholar 

  7. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: KDD’09, pp. 1245–1254. ACM (2009)

    Google Scholar 

  8. Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: ICIKM 2005, ACM (2005)

    Google Scholar 

  9. Geide, M.: N-gram Character Sequence Analysis of Benign vs. Malicious Domains/URLs. Available at http://analysis-manifold.com/ Accessed 24 March 2015

  10. Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using html features and url features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)

    Google Scholar 

  11. Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive bayesian algorithm. In: IMECS2012, vol. 1, p. 466 (2012)

    Google Scholar 

  12. Riboni, D. Feature selection for web page classification. In: EURASIA-ICT-2002 (2002)

    Google Scholar 

  13. Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. LNAI 8557, 39–54 (2014)

    Google Scholar 

  14. URLBlacklist.com.: http://urlblacklist.com/ Accessed 24 March 2015

  15. Shalla Secure Services KG.: http://www.shallalist.de/ Accessed 24 March 2015

Download references

Acknowledgment

This research is being supported by The Ministry of Education and Science of The Russian Federation (contract # 14.604.21.0147, unique contract identifier RFMEFI60414X0147).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Kotenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Novozhilov, D., Kotenko, I., Chechulin, A. (2016). Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds) Intelligent Distributed Computing IX. Studies in Computational Intelligence, vol 616. Springer, Cham. https://doi.org/10.1007/978-3-319-25017-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25017-5_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25015-1

  • Online ISBN: 978-3-319-25017-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics