Skip to main content

Improvements of HITS Algorithms for Spam Links

  • Conference paper
Book cover Advances in Data and Web Management (APWeb 2007, WAIM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4505))

Abstract

The HITS algorithm proposed by Kleinberg is one of the representative methods of scoring Web pages by using hyperlinks. In the days when the algorithm was proposed, most of the pages given high score by the algorithm were really related to a given topic, and hence the algorithm could be used to find related pages. However, the algorithm and the variants including BHITS proposed by Bharat and Henzinger cannot be used to find related pages any more on today’s Web, due to an increase of spam links. In this paper, we first propose three methods to find “linkfarms,” that is, sets of spam links forming a densely connected subgraph of a Web graph. We then present an algorithm, called a trust-score algorithm, to give high scores to pages which are not spam pages with a high probability. Combining the three methods and the trust-score algorithm with BHITS, we obtain several variants of the HITS algorithm. We ascertain by experiments that one of them, named TaN+BHITS using the trust-score algorithm and the method of finding linkfarm by employing name servers, is most suitable for finding related pages on today’s Web. Our algorithms take time and memory no more than those required by the original HITS algorithm, and can be executed on a PC with a small amount of main memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proc. 21st ACM SIGIR Conference, pp. 104–111 (1998)

    Google Scholar 

  2. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Finding authorities and hubs from link structures on the World Wide Web. In: Proc. 10th WWW Conference, pp. 415–429 (2001)

    Google Scholar 

  3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proc. 7th WWW Conference, pp. 14–18 (1998)

    Google Scholar 

  4. Costa Carvalho, A., Chirita, P., Moura, E., Calado, P., Nejdl, W.: Site level noise removal for search engines. In: Proc. 15th WWW Conference, pp. 73–82 (2006)

    Google Scholar 

  5. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. In: Proc. 7th International Workshop on the Web and Databases, pp. 1–6 (2004)

    Google Scholar 

  6. Fetterly, D., Manasse, M., Najork, M., Ntoulas, A.: Detecting spam Web pages through content analysis. In: Proc. 15th WWW Conference, pp. 83–92 (2006)

    Google Scholar 

  7. Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web spam with TrustRank. In: Proc. 30th VLDB Conference, pp. 576–587 (2004)

    Google Scholar 

  8. Jeh, G., Widom, J.: SimRank: A measure of structual-context similarity. In: Proc. 8th ACM SIGKDD Conference, pp. 538–543 (2002)

    Google Scholar 

  9. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677 (1998)

    Google Scholar 

  10. Lempel, R., Moran, S.: The stochastic approach for link-structure analysis (SALSA) and the tkc effect. In: Proc. 9th WWW Conference, pp. 387–401 (2000)

    Google Scholar 

  11. Li, L., Shang, Y., Zhang, W.: Improvement of HITS-based algorithms on Web documents. In: Proc. 11th WWW Conference, pp. 527–535 (2002)

    Google Scholar 

  12. Wang, M.: A significant improvement to Clever algorithm in hyperlinked environment. In: Poster Proc. 11th WWW Conference (2002)

    Google Scholar 

  13. Wang, X., Lu, Z., Zhou, A.: Topic Exploration and Distillation for Web Search by a Similarity-Based Analysis. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, pp. 316–327. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proc. 14th WWW Conference, pp. 820–829 (2005)

    Google Scholar 

  15. Zhang, Y., Yu, J.X., Hou, J.: Web Communities: Analysis and Construction. Springer, Berlin (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Guozhu Dong Xuemin Lin Wei Wang Yun Yang Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Asano, Y., Tezuka, Y., Nishizeki, T. (2007). Improvements of HITS Algorithms for Spam Links. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72524-4_50

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72483-4

  • Online ISBN: 978-3-540-72524-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics