Skip to main content

A Machine Learning Approach to Web Mining

  • Conference paper
  • First Online:
AI*IA 99: Advances in Artificial Intelligence (AI*IA 1999)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1792))

Included in the following conference series:

Abstract

In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning Journal 6 (1991) 37–66

    Google Scholar 

  2. Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Information Systems 123 (1995) 233–251

    Article  Google Scholar 

  3. Attardi, G., Di Marco, S., Salvi, D., Sebastiani, F.:. Categorisation by context. On-line Proc. of the 1st Int. Workshop on Innovative Internet Information Systems (1998). http://www.idt.ntnu.no/~monica/iii-98/proceedings_on_line.html

  4. Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public Web search engines. Proc. of the 7th Int. WWW Conf., Brisbane Australia (1998) 379–388. http://decweb.ethz.ch/WWW7/1937/com1937.htm

  5. Broder, A., Glassman, S., Manasse, M.: Clustering the Web. http://www.research.digital.com/SRC/articles/199707/cluster.html

  6. Esposito, F., Malerba, D., Di Pace, L., Leo P.: A learning Intermediary for Automated Classification of Web Pages. Proc. of the ICML-99 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia (1999) 37–46

    Google Scholar 

  7. Etzioni O.: The World-Wide Web: Quagmire or gold mine? Communications of the ACM 391 (1996) 65–68

    Article  Google Scholar 

  8. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Proc. of the 14th Int. Conf. on Machine Learning (1997) 143–151

    Google Scholar 

  9. Koller, D., Sahami, M.: Toward optimal feature selection. Proc. of the 13th Int. Conf. on Machine Learning (1996) 284–292

    Google Scholar 

  10. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. Proc. of the 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1995) 246–254

    Google Scholar 

  11. Lewis, D.D, Schapire, R.E, Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson, (ed.), Proceedings of the 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (1996) 298–306

    Google Scholar 

  12. Masand, B., Linoff, G., Waltz, D.: Classifying new stories using memory based reasoning. Proceedings SIGIR’92 (1992) 59–65

    Google Scholar 

  13. Mladenic, D.: Feature subset selection in text-learning. In C. Nédellec, & C. Rouveirol (Eds.), Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, 1398, 95–100, Springer Berlin (1998)

    Google Scholar 

  14. Murthy, S.K., Kasif, S., Salzberg S.: A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2 (1994) 1–32

    Article  MATH  Google Scholar 

  15. Pazzani, M., Billsus D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning Journal 23 (1997) 313–331

    Article  Google Scholar 

  16. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5) (1988) 513–523

    Article  Google Scholar 

  17. Smith. Z.: The truth about the Web: Crawling towards the eternity. Web Techniques Magazine (1997) http://www.webtechniques.com/features/1997/05/burner/burner.shtml

  18. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. Proceedings of the 14th Int. Conf. on Machine Learning, (1997) 412–420.

    Google Scholar 

  19. Wilks, Y.: Information Extraction as a core language technology. Information Extraction SCIE-97 Springer Verlag (1997).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Esposito, F., Malerba, D., Di Pace, L., Leo, P. (2000). A Machine Learning Approach to Web Mining. In: Lamma, E., Mello, P. (eds) AI*IA 99: Advances in Artificial Intelligence. AI*IA 1999. Lecture Notes in Computer Science(), vol 1792. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46238-4_17

Download citation

  • DOI: https://doi.org/10.1007/3-540-46238-4_17

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67350-7

  • Online ISBN: 978-3-540-46238-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics