Abstract
Combating Web spam has become one of the top challenges for Web search engines. Most previous researches in link-based Web spam identification focus on exploiting hyperlink graphs and corresponding user-behavior models. However, the fact that hyperlinks can be easily added and removed by Web spammers makes hyperlink graph unreliable. We construct a user browsing graph based on users’ Web access log and adopt link analysis algorithms on this graph to identify Web spam pages. The constructed graph is much smaller than the original Web Graph, and link analysis algorithms can perform efficiently on them. Comparative experimental results also show that algorithms performed on the constructed graph outperforms those on the original graph.
Supported by the Chinese National Key Foundation Research & Development Plan (2004CB318108), Natural Science Foundation (60621062, 60503064, 60736044) and National 863 High Technology Project (2006AA01Z141).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
CNNIC (China Internet Network Information Center), the 23th report in development of Internet in China, http://www.cnnic.net.cn/uploadfiles/pdf/2009/1/13/92458.pdf
Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large Web search engine query log. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 6–12. ACM Press, California (1999)
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web spam with TrustRank. In: Proceedings of the 30th VLDB Conference, pp. 576–587. ACM Press, Toronto (2004)
Benczúr, A.A., Csalogány, K., Sarlós, T., et al.: SpamRank-Fully Automatic Link Spam Detection Work in progress. In: 1st international Workshop on Adversarial information Retrieval on the Web, Chiba (2005), http://airweb.cse.lehigh.edu/2005/benczur.pdf
Craswell, N., Hawking, D., Robertson, S.: Effective site finding using link anchor information. In: Proceedings of the 24th SIGIR Conference, pp. 250–257. ACM Press, New Orleans (2001)
Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., Li, H.: BrowseRank: letting Web users vote for page importance. In: Proceedings of the 31st SIGIR Conference, pp. 451–458. ACM Press, Singapore (2008)
Bilenko, M., White, R.W.: Mining the search trails of surfing crowds: identifying relevant Websites from user activity. In: Proceeding of the 17th WWW Conference, pp. 51–60. ACM Press, Beijing (2008)
Liu, Y., Cen, R., Zhang, M., Ma, S., Ru, L.: Identifying Web spam with user behavior analysis. In: 4th international Workshop on Adversarial information Retrieval on the Web, pp. 9–16. ACM Press, Beijing (2008)
Wu, B., Goel, V., Davison, B.D.: Topical TrustRank: Using topicality to combat web spam. In: Proceedings of the 15th WWW Conference, pp. 63–72. ACM Press, Scotland (2006)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th WWW Conference, pp. 83–92. ACM Press, Scotland (2006)
Svore, K., Wu, Q., Burges, C., Raman, A.: Improving Web Spam Classification using Rank-time Features. In: Proceedings of AIRWeb 2007, pp. 9–16. ACM Press, New York (2007)
Liu, Y., Zhang, M., Ma, S.: Web key resource page selection based on non content information. J. Transactions on Intelligent System 2(1), 45–52 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, H., Liu, Y., Zhang, M., Ru, L., Ma, S. (2009). Web Spam Identification with User Browsing Graph. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-04769-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)