A Simple, Structure-Sensitive Approach for Web Document Classification

Markov, Alex; Last, Mark

doi:10.1007/11495772_46

Alex Markov²¹ &
Mark Last²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3528))

Included in the following conference series:

International Atlantic Web Intelligence Conference

955 Accesses
5 Citations

Abstract

In this paper we describe a new approach to classification of web documents. Most web classification methods are based on the vector space document representation of information retrieval. Recently the graph based web document representation model was shown to outperform the traditional vector representation using k-Nearest Neighbor (k-NN) classification algorithm. Here we suggest a new hybrid approach to web document classification built upon both, graph and vector representations. K-NN algorithm and three benchmark document collections were used to compare this method to graph and vector based methods separately. Results demonstrate that we succeed in most cases to outperform graph and vector approaches in terms of classification accuracy along with a significant reduction in classification time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)
Google Scholar
Kuramochi, M., Karypis, G.: ”‘An Efficient Algorithm for Discovering Frequent Subgraphs”, Technical Report TR\(\sharp\) 02-26, Dept. of Computer Science and Engineering, University of Minnesota (2002)
Google Scholar
Maimon, O., Last, M.: Knowledge Discovery and Data Mining - The Info-Fuzzy Network (IFN) Methodology. Kluwer Academic Publishers, Dordrecht (2000)
MATH Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Mukund, D., Kuramochi, M., Karypis, G.: Frequent sub-structure-based approaches for classifying chemical compounds. In: ICDM 2003, Third IEEE International Conference (2003)
Google Scholar
Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1971)
Article Google Scholar
Schenker, A.: Graph-Theoretic Techniques for Web Content Mining. Ph.D. Thesis, University of South Florida (2003)
Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–496 (2004)
Google Scholar
Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Article Google Scholar
Yan, X., Gspan, J.H.: Graph-based substructure pattern mining, Technical Report UIUCDCS-R-2002-2296, Department of Computer Science, University of Illinois at UrbanaChampaign (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of Negev, Beer-Sheva, 84105, Israel
Alex Markov & Mark Last

Authors

Alex Markov
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01–447, Warsaw, Poland
Janusz Kacprzyk
Institute of Computer Science, Technical University of Łódź, Poland
Adam Niewiadomski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Markov, A., Last, M. (2005). A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds) Advances in Web Intelligence. AWIC 2005. Lecture Notes in Computer Science(), vol 3528. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11495772_46

Download citation

DOI: https://doi.org/10.1007/11495772_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26219-0
Online ISBN: 978-3-540-31900-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics