Abstract
This paper focuses on the main aspects of development of a qualitative system for dynamic content filtering. These aspects include collection of meaningful training data and the feature selection techniques. The Web changes rapidly so the classifier needs to be regularly re-trained. The problem of training data collection is treated as a special case of the focused crawling. A simple and easy-to-tune technique was proposed, implemented and tested. The proposed feature selection technique tends to minimize the feature set size without loss of accuracy and to consider interlinked nature of the Web. This is essential to make a content filtering solution fast and non-burdensome for end users, especially when content filtering is performed using a restricted hardware. Evaluation and comparison of various classifiers and techniques are provided.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Suvorov, R., Sochenkov, I., Tikhomirov, I.: Method for pornography filtering in the WEB based on automatic classification and natural language processing. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 233–240. Springer, Heidelberg (2013)
de Groc, C.: Babouk: Focused web crawling for corpus compilation and automatic terminology extraction. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 497–498 (August 2011)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 69–90 (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Dong, H., Hussain, F.K., Chang, E.: State of the art in semantic focused crawlers. In: Gervasi, O., Taniar, D., Murgante, B., Laganà , A., Mun, Y., Gavrilova, M.L. (eds.) ICCSA 2009, Part II. LNCS, vol. 5593, pp. 910–924. Springer, Heidelberg (2009)
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: LREC (2004)
AOL Inc.: Open directory project, http://www.dmoz.org
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely url-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 1109–1110. ACM, New York (2009)
Shih, L.K., Karger, D.R.: Using urls and table layout for web classification tasks. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 193–202. ACM, New York (2004)
Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: Proceedings of the 10th International Conference on World Wide Web, pp. 96–105. ACM (2001)
Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H.: A method for focused crawling using combination of link structure and content similarity. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 753–756. IEEE (2006)
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Porter, M.F.: Snowball: A language for stemming algorithms (2001)
Liu, T., Liu, S., Chen, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: ICML, vol. 3, pp. 488–495 (2003)
Mitchell, T.: Machine Learning. McGraw Hill (1997)
Osipov, G., Smirnov, I., Tikhomirov, I., Vybornova, O.: Technologies for semantic analysis of scientific publications. In: 2012 6th IEEE International Conference on Intelligent Systems (IS), pp. 058–062 (September 2012)
Osipov, G., Smirnov, I., Tikhomirov, I., Shelmanov, A.: Relational-situational method for intelligent search and analysis of scientific publications. In: Proceedings of the Integrating IR Technologies for Professional Search Workshop, pp. 57–64 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Suvorov, R., Sochenkov, I., Tikhomirov, I. (2014). Training Datasets Collection and Evaluation of Feature Selection Methods for Web Content Filtering. In: Agre, G., Hitzler, P., Krisnadhi, A.A., Kuznetsov, S.O. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2014. Lecture Notes in Computer Science(), vol 8722. Springer, Cham. https://doi.org/10.1007/978-3-319-10554-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-10554-3_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10553-6
Online ISBN: 978-3-319-10554-3
eBook Packages: Computer ScienceComputer Science (R0)