Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models

Yazdani, Majid; Eftekhar, Milad; Abolhassani, Hassan

doi:10.1007/978-3-642-01307-2_80

Majid Yazdani²³,
Milad Eftekhar²³ &
Hassan Abolhassani²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3149 Accesses

Abstract

One important problem proposed recently in the field of web mining is website classification problem. The complexity together with the necessity to have accurate and fast algorithms yield to many attempts in this field, but there is a long way to solve these problems efficiently, yet. The importance of the problem encouraged us to work on a new approach as a solution. We use the content of web pages together with the link structure between them to improve the accuracy of results. In this work we use Naïve-bayes models for each predefined webpage class and an extended version of Hidden Markov Model is used as website class models. A few sample websites are adopted as seeds to calculate models’ parameters. For classifying the websites we represent them with tree structures and we modify the Viterbi algorithm to evaluate the probability of generating these tree structures by every website model. Because of the large amount of pages in a website, we use a sampling technique that not only reduces the running time of the algorithm but also improves the accuracy of the classification process. At the end of this paper, we provide some experimental results which show the performance of our algorithm compared to the previous ones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Attardi, G., Gullí, A., Sebastiani, F.: Automatic Web page categorization by link and context analysis. In: Proc. of THAI-99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, Varese, IT, pp. 105–119 (1999)
Google Scholar
Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proc. ACM SIGMOD, Seattle, US, pp. 307–318 (1998)
Google Scholar
DMOZ. open directory project
Google Scholar
Frasconi, P., Soda, G., Vullo, A.: Text categorization for multi-page documents: A hybrid naïve bayes hmm approach. In: Proc. 1st ACM-IEEE Joint Conference on Digital Libraries (2001)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher, San Francisco (2006)
MATH Google Scholar
Ester, M., Kriegel, H., Schubert, M.: Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In: Proc. of SIGKDD 2002, Edmonton, Alberta, Canada, pp. 249–258 (2002)
Google Scholar
Kriegel, H.P., Schubert, M.: Classification of Websites as Sets of Feature Vectors. In: Proc. IASTED DBA (2004)
Google Scholar
Pierre, J.M.: On the automated classification of web sites. In: Linkö ping Electronic Article in Computer and Information Science, Sweden 6(001) (2001)
Google Scholar
Shen, D., Chen, Z., Zeng, H.-J., Zhang, B., Yang, Q., Ma, W.-Y., Lu, Y.: Web-page classification through summarization. In: Proc. of 27th Annual International ACM SIGIR Conference (2004)
Google Scholar
Tian, Y.-H., Huang, T.-J., Gao, W.: Two-phase web site classification based on hidden markov tree models. Web Intelli. and Agent Sys. 2(4), 249–264 (2004)
Google Scholar
Yahoo! Directory service
Google Scholar

Download references

Author information

Authors and Affiliations

Web Intelligence Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Majid Yazdani, Milad Eftekhar & Hassan Abolhassani

Authors

Majid Yazdani
View author publications
You can also search for this author in PubMed Google Scholar
Milad Eftekhar
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Abolhassani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yazdani, M., Eftekhar, M., Abolhassani, H. (2009). Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_80

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics