Abstract
The HITS algorithm proposed by Kleinberg is one of the representative methods of scoring Web pages by using hyperlinks. In the days when the algorithm was proposed, most of the pages given high score by the algorithm were really related to a given topic, and hence the algorithm could be used to find related pages. However, the algorithm and the variants including BHITS proposed by Bharat and Henzinger cannot be used to find related pages any more on today’s Web, due to an increase of spam links. In this paper, we first propose three methods to find “linkfarms,” that is, sets of spam links forming a densely connected subgraph of a Web graph. We then present an algorithm, called a trust-score algorithm, to give high scores to pages which are not spam pages with a high probability. Combining the three methods and the trust-score algorithm with BHITS, we obtain several variants of the HITS algorithm. We ascertain by experiments that one of them, named TaN+BHITS using the trust-score algorithm and the method of finding linkfarm by employing name servers, is most suitable for finding related pages on today’s Web. Our algorithms take time and memory no more than those required by the original HITS algorithm, and can be executed on a PC with a small amount of main memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proc. 21st ACM SIGIR Conference, pp. 104–111 (1998)
Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Finding authorities and hubs from link structures on the World Wide Web. In: Proc. 10th WWW Conference, pp. 415–429 (2001)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proc. 7th WWW Conference, pp. 14–18 (1998)
Costa Carvalho, A., Chirita, P., Moura, E., Calado, P., Nejdl, W.: Site level noise removal for search engines. In: Proc. 15th WWW Conference, pp. 73–82 (2006)
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. In: Proc. 7th International Workshop on the Web and Databases, pp. 1–6 (2004)
Fetterly, D., Manasse, M., Najork, M., Ntoulas, A.: Detecting spam Web pages through content analysis. In: Proc. 15th WWW Conference, pp. 83–92 (2006)
Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web spam with TrustRank. In: Proc. 30th VLDB Conference, pp. 576–587 (2004)
Jeh, G., Widom, J.: SimRank: A measure of structual-context similarity. In: Proc. 8th ACM SIGKDD Conference, pp. 538–543 (2002)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677 (1998)
Lempel, R., Moran, S.: The stochastic approach for link-structure analysis (SALSA) and the tkc effect. In: Proc. 9th WWW Conference, pp. 387–401 (2000)
Li, L., Shang, Y., Zhang, W.: Improvement of HITS-based algorithms on Web documents. In: Proc. 11th WWW Conference, pp. 527–535 (2002)
Wang, M.: A significant improvement to Clever algorithm in hyperlinked environment. In: Poster Proc. 11th WWW Conference (2002)
Wang, X., Lu, Z., Zhou, A.: Topic Exploration and Distillation for Web Search by a Similarity-Based Analysis. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, pp. 316–327. Springer, Heidelberg (2002)
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proc. 14th WWW Conference, pp. 820–829 (2005)
Zhang, Y., Yu, J.X., Hou, J.: Web Communities: Analysis and Construction. Springer, Berlin (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Asano, Y., Tezuka, Y., Nishizeki, T. (2007). Improvements of HITS Algorithms for Spam Links. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-72524-4_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72483-4
Online ISBN: 978-3-540-72524-4
eBook Packages: Computer ScienceComputer Science (R0)