Abstract
In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning Journal 6 (1991) 37–66
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Information Systems 123 (1995) 233–251
Attardi, G., Di Marco, S., Salvi, D., Sebastiani, F.:. Categorisation by context. On-line Proc. of the 1st Int. Workshop on Innovative Internet Information Systems (1998). http://www.idt.ntnu.no/~monica/iii-98/proceedings_on_line.html
Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public Web search engines. Proc. of the 7th Int. WWW Conf., Brisbane Australia (1998) 379–388. http://decweb.ethz.ch/WWW7/1937/com1937.htm
Broder, A., Glassman, S., Manasse, M.: Clustering the Web. http://www.research.digital.com/SRC/articles/199707/cluster.html
Esposito, F., Malerba, D., Di Pace, L., Leo P.: A learning Intermediary for Automated Classification of Web Pages. Proc. of the ICML-99 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia (1999) 37–46
Etzioni O.: The World-Wide Web: Quagmire or gold mine? Communications of the ACM 391 (1996) 65–68
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Proc. of the 14th Int. Conf. on Machine Learning (1997) 143–151
Koller, D., Sahami, M.: Toward optimal feature selection. Proc. of the 13th Int. Conf. on Machine Learning (1996) 284–292
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. Proc. of the 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1995) 246–254
Lewis, D.D, Schapire, R.E, Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson, (ed.), Proceedings of the 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1996) 298–306
Masand, B., Linoff, G., Waltz, D.: Classifying new stories using memory based reasoning. Proceedings SIGIR’92 (1992) 59–65
Mladenic, D.: Feature subset selection in text-learning. In C. Nédellec, & C. Rouveirol (Eds.), Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, 1398, 95–100, Springer Berlin (1998)
Murthy, S.K., Kasif, S., Salzberg S.: A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2 (1994) 1–32
Pazzani, M., Billsus D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning Journal 23 (1997) 313–331
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5) (1988) 513–523
Smith. Z.: The truth about the Web: Crawling towards the eternity. Web Techniques Magazine (1997) http://www.webtechniques.com/features/1997/05/burner/burner.shtml
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. Proceedings of the 14th Int. Conf. on Machine Learning, (1997) 412–420.
Wilks, Y.: Information Extraction as a core language technology. Information Extraction SCIE-97 Springer Verlag (1997).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esposito, F., Malerba, D., Di Pace, L., Leo, P. (2000). A Machine Learning Approach to Web Mining. In: Lamma, E., Mello, P. (eds) AI*IA 99: Advances in Artificial Intelligence. AI*IA 1999. Lecture Notes in Computer Science(), vol 1792. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46238-4_17
Download citation
DOI: https://doi.org/10.1007/3-540-46238-4_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67350-7
Online ISBN: 978-3-540-46238-5
eBook Packages: Springer Book Archive