Abstract
One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Davison, B.D.: Topical locality in the web. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval, pp. 272–279 (2000)
Zhang, M., Lin, C., Liu, Y., Zhao, L., Ma, L., Ma, S.: THUIR at TREC 2003: Novelty, Robust, Web and HARD (2003)
Hawking, D., Craswell, N.: Overview of the TREC-2002 web track. In: Voorhees and Buckland (2002)
Hawking, D., Craswell, N.: Overview of the TREC 2003 web track, 2003. In: NIST Special Publication: SP 500-255, The Twelfth Text Retrieval Conference (2003)
Lozano, J.A., Pena, J.M., Larranaga, P.: An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett. 20, 1027–1040 (1999)
Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in a hyperlinked environment. In: 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–111 (August 1998)
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web Search Engines. In: proceedings of the International Joint Conference on Artificial Intelligence (2003)
Craswell, N., Hawking, D.: Query-independent evidence in home page finding. ACM Transactions on Information Systems (TOIS) archive 21(3), 286–313 (2003); table of contents
Westerveld, T., Hiemstra, D., Kraaij, W.: Retrieving Web Pages Using Content, Links, URLs and Anchors. In: Voorhees and Harman, pp. 663–672 (2002)
Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: 25th annual international ACM SIGIR conference on research and development in information retrieval, pp. 27–34 (2002)
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition (2003)
Liu, Y., Zhang, M., Ma, S.: Effective topic distillation with key resource pre-selection. In: Proceedings of the Asia Information Retrieval Symposium (2004)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.: The analysis of a simple k-means clustering algorithm. In: Symposium on Computational Geometry (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, C., Liu, Y., Zhang, M., Ma, S. (2005). Topic-Independent Web High-Quality Page Selection Based on K-Means Clustering. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_43
Download citation
DOI: https://doi.org/10.1007/11562382_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)