Skip to main content

A Simple, Structure-Sensitive Approach for Web Document Classification

  • Conference paper
Advances in Web Intelligence (AWIC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3528))

Included in the following conference series:

Abstract

In this paper we describe a new approach to classification of web documents. Most web classification methods are based on the vector space document representation of information retrieval. Recently the graph based web document representation model was shown to outperform the traditional vector representation using k-Nearest Neighbor (k-NN) classification algorithm. Here we suggest a new hybrid approach to web document classification built upon both, graph and vector representations. K-NN algorithm and three benchmark document collections were used to compare this method to graph and vector based methods separately. Results demonstrate that we succeed in most cases to outperform graph and vector approaches in terms of classification accuracy along with a significant reduction in classification time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  2. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)

    Google Scholar 

  3. Kuramochi, M., Karypis, G.: ”‘An Efficient Algorithm for Discovering Frequent Subgraphs”, Technical Report TR\(\sharp\) 02-26, Dept. of Computer Science and Engineering, University of Minnesota (2002)

    Google Scholar 

  4. Maimon, O., Last, M.: Knowledge Discovery and Data Mining - The Info-Fuzzy Network (IFN) Methodology. Kluwer Academic Publishers, Dordrecht (2000)

    MATH  Google Scholar 

  5. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  6. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  7. Mukund, D., Kuramochi, M., Karypis, G.: Frequent sub-structure-based approaches for classifying chemical compounds. In: ICDM 2003, Third IEEE International Conference (2003)

    Google Scholar 

  8. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  9. Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)

    Google Scholar 

  10. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1971)

    Article  Google Scholar 

  11. Schenker, A.: Graph-Theoretic Techniques for Web Content Mining. Ph.D. Thesis, University of South Florida (2003)

    Google Scholar 

  12. Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–496 (2004)

    Google Scholar 

  13. Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)

    Article  Google Scholar 

  14. Yan, X., Gspan, J.H.: Graph-based substructure pattern mining, Technical Report UIUCDCS-R-2002-2296, Department of Computer Science, University of Illinois at UrbanaChampaign (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Markov, A., Last, M. (2005). A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds) Advances in Web Intelligence. AWIC 2005. Lecture Notes in Computer Science(), vol 3528. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11495772_46

Download citation

  • DOI: https://doi.org/10.1007/11495772_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26219-0

  • Online ISBN: 978-3-540-31900-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics