A Machine Learning Approach to Web Mining

Esposito, Floriana; Malerba, Donato; Di Pace, Luigi; Leo, Pietro

doi:10.1007/3-540-46238-4_17

Floriana Esposito²,
Donato Malerba²,
Luigi Di Pace³ &
…
Pietro Leo³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1792))

Included in the following conference series:

Congress of the Italian Association for Artificial Intelligence

453 Accesses
3 Citations

Abstract

In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning Journal 6 (1991) 37–66
Google Scholar
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Information Systems 123 (1995) 233–251
Article Google Scholar
Attardi, G., Di Marco, S., Salvi, D., Sebastiani, F.:. Categorisation by context. On-line Proc. of the 1^st Int. Workshop on Innovative Internet Information Systems (1998). http://www.idt.ntnu.no/~monica/iii-98/proceedings_on_line.html
Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public Web search engines. Proc. of the 7th Int. WWW Conf., Brisbane Australia (1998) 379–388. http://decweb.ethz.ch/WWW7/1937/com1937.htm
Broder, A., Glassman, S., Manasse, M.: Clustering the Web. http://www.research.digital.com/SRC/articles/199707/cluster.html
Esposito, F., Malerba, D., Di Pace, L., Leo P.: A learning Intermediary for Automated Classification of Web Pages. Proc. of the ICML-99 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia (1999) 37–46
Google Scholar
Etzioni O.: The World-Wide Web: Quagmire or gold mine? Communications of the ACM 391 (1996) 65–68
Article Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Proc. of the 14^th Int. Conf. on Machine Learning (1997) 143–151
Google Scholar
Koller, D., Sahami, M.: Toward optimal feature selection. Proc. of the 13^th Int. Conf. on Machine Learning (1996) 284–292
Google Scholar
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. Proc. of the 19^th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1995) 246–254
Google Scholar
Lewis, D.D, Schapire, R.E, Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson, (ed.), Proceedings of the 19^th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1996) 298–306
Google Scholar
Masand, B., Linoff, G., Waltz, D.: Classifying new stories using memory based reasoning. Proceedings SIGIR’92 (1992) 59–65
Google Scholar
Mladenic, D.: Feature subset selection in text-learning. In C. Nédellec, & C. Rouveirol (Eds.), Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, 1398, 95–100, Springer Berlin (1998)
Google Scholar
Murthy, S.K., Kasif, S., Salzberg S.: A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2 (1994) 1–32
Article MATH Google Scholar
Pazzani, M., Billsus D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning Journal 23 (1997) 313–331
Article Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5) (1988) 513–523
Article Google Scholar
Smith. Z.: The truth about the Web: Crawling towards the eternity. Web Techniques Magazine (1997) http://www.webtechniques.com/features/1997/05/burner/burner.shtml
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. Proceedings of the 14^th Int. Conf. on Machine Learning, (1997) 412–420.
Google Scholar
Wilks, Y.: Information Extraction as a core language technology. Information Extraction SCIE-97 Springer Verlag (1997).
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari, via Orabona, 4, 70126, Bari
Floriana Esposito & Donato Malerba
Java Technology Center, IBM SEMEA Sud, via Tridente, 42/14, 70125, Bari
Luigi Di Pace & Pietro Leo

Authors

Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Di Pace
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Leo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DEIS, University of Bologna, Viale Risorgimento, 2, 40136, Bologna, Italy
Evelina Lamma & Paola Mello &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esposito, F., Malerba, D., Di Pace, L., Leo, P. (2000). A Machine Learning Approach to Web Mining. In: Lamma, E., Mello, P. (eds) AI*IA 99: Advances in Artificial Intelligence. AI*IA 1999. Lecture Notes in Computer Science(), vol 1792. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46238-4_17

Download citation

DOI: https://doi.org/10.1007/3-540-46238-4_17
Published: 15 December 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67350-7
Online ISBN: 978-3-540-46238-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics