skip to main content
10.1145/1135777.1135793acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Site level noise removal for search engines

Published: 23 May 2006 Publication History

Abstract

The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.

References

[1]
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 38--47, 2003.
[2]
Badrank. http://en.efactory.de/e-pr0.shtml.
[3]
R. Baeza-Yates, C. Castillo, and V. López. Pagerank increase under different collusion topologies. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.
[4]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.
[5]
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.
[6]
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000.
[7]
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of 21st ACM International SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, 1998.
[8]
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, pages 415--429, 2001.
[9]
S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in your pocket? Data Engineering Bulletin, 21(2):37--47, 1998.
[10]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997.
[11]
S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proc. of the 10th International Conference on World Wide Web, pages 211--220, 2001.
[12]
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2003.
[13]
B. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, 2000.
[14]
N. Eiron and K. S. McCurley. Untangling compound documents on the web. In Proc. of the 14th ACM Conference on Hypertext and Hypermedia, pages 85--94, 2003.
[15]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1--6, 2004.
[16]
Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of the 31st International VLDB Conference on Very Large Data Bases, pages 517--528, 2005.
[17]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the Adversarial Information Retrieval held the 14th Intl. World Wide Web Conference, 2005.
[18]
Z. Gyöngyi, H. Garcia-Molina, and J. Pendersen. Combating web spam with trustrank. In Proceedings of the 30th International VLDB Conference, 2004.
[19]
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the trec8 web track. In Eighth Text Retrieval Conference, 1999.
[20]
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005.
[21]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.
[22]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Proceeding of the 8th International Conference on World Wide Web, pages 1481--1493, 1999.
[23]
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1-6):387--401, 2000.
[24]
L. Li, Y. Shang, and W. Zhang. Improvement of hits-based algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, pages 527--535, 2002.
[25]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
[26]
G. Roberts and J. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics (ADAS), 3:199--216, 2003.
[27]
B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th World Wide Web Conference, 2005.
[28]
B. Wu and B. Davison. Undue influence: Eliminating the impact of link plagiarism on web search rankings. Technical report, LeHigh University, 2005.
[29]
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. van Roy. Improving eigenvector-based reputation systems against collusions. In Proceedings of the 3rd Workshop on Web Graph Algorithms, 2004.

Cited By

View all
  • (2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
  • (2021)System-aware dynamic partitioning for batch and streaming workloadsProceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing10.1145/3468737.3494087(1-10)Online publication date: 6-Dec-2021
  • (2016)User-intent visual information ranking system2016 IEEE 5th Global Conference on Consumer Electronics10.1109/GCCE.2016.7800316(1-2)Online publication date: Oct-2016
  • Show More Cited By

Index Terms

  1. Site level noise removal for search engines

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '06: Proceedings of the 15th international conference on World Wide Web
    May 2006
    1102 pages
    ISBN:1595933239
    DOI:10.1145/1135777
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 May 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. PageRank
    2. link analysis
    3. noise reduction
    4. spam

    Qualifiers

    • Article

    Conference

    WWW06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
    • (2021)System-aware dynamic partitioning for batch and streaming workloadsProceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing10.1145/3468737.3494087(1-10)Online publication date: 6-Dec-2021
    • (2016)User-intent visual information ranking system2016 IEEE 5th Global Conference on Consumer Electronics10.1109/GCCE.2016.7800316(1-2)Online publication date: Oct-2016
    • (2016)A Machine Learning Based Web Spam Filtering Approach2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)10.1109/AINA.2016.177(973-980)Online publication date: Mar-2016
    • (2014)A study on health care consumers’ diabetes term usage across identified categoriesAslib Journal of Information Management10.1108/AJIM-01-2014-000866:4(443-463)Online publication date: 15-Jul-2014
    • (2013)Web Spam Detection Using MapReduce Approach to Collective ClassificationInternational Joint Conference CISIS’12-ICEUTE´12-SOCO´12 Special Sessions10.1007/978-3-642-33018-6_20(197-206)Online publication date: 2013
    • (2012)Survey on web spam detectionACM SIGKDD Explorations Newsletter10.1145/2207243.220725213:2(50-64)Online publication date: 1-May-2012
    • (2012)Using site-level connections to estimate link confidenceJournal of the American Society for Information Science and Technology10.1002/asi.2272963:11(2294-2312)Online publication date: 1-Nov-2012
    • (2011)Web Spam Detection by Exploring Densely Connected SubgraphsProceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 0110.1109/WI-IAT.2011.152(124-129)Online publication date: 22-Aug-2011
    • (2010)Identifying spam link generators for monitoring emerging web spamProceedings of the 4th workshop on Information credibility10.1145/1772938.1772950(51-58)Online publication date: 27-Apr-2010
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media