Export Citations
Before the advent of the World Wide Web, information retrieval algorithms were developed for relatively small and coherent document collections such as newspaper articles or book catalogs in a library. In comparison to these collections, the Web is massive, much less cohe-rent, changes more rapidly, and is spread over geographically distributed computers. Scal-ing information retrieval algorithms to the World Wide Web is a challenging task. Success to date is depicted by the ubiquitous use of search engines to access Internet content.
From the point of view of a search engine, the Web is a mix of two types of content: the "closed Web" and the "open Web". The closed web comprises a few high-quality controlled collections which a search engine can fully trust. The "open Web," on the other hand, in-cludes the vast majority of Web pages, which lack an authority asserting their quality. The openness of the Web has been the key to its rapid growth and success. However, this open-ness is also a major source of new challenges for information retrieval methods.
Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, re-trieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" or spamdexing, i.e.: malicious attempts to influence the outcome of ranking al-gorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue.
Proceeding Downloads
A large-scale study of automated web search traffic
As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a ...
Identifying web spam with user behavior analysis
Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user ...
Query-log mining for detecting spam
Every day millions of users search for information on the web via search engines, and provide implicit feedback to the results shown for their queries by clicking or not onto them. This feedback is encoded in the form of a query log that consists of a ...
Cleaning search results using term distance features
The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated ...
Exploring linguistic features for web spam detection: a preliminary study
We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems ...
Latent dirichlet allocation in web spam filtering
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web ...
Analysing features of Japanese splogs and characteristics of keywords
- Yuuki Sato,
- Takehito Utsuro,
- Yoshiaki Murakami,
- Tomohiro Fukuhara,
- Hiroshi Nakagawa,
- Yasuhide Kawada,
- Noriko Kando
This paper focuses on analyzing (Japanese) splogs based on various characteristics of keywords contained in them. We estimate the behavior of spammers when creating splogs from other sources by analyzing the characteristics of keywords contained in ...
Web spam identification through content and hyperlinks
We present an algorithm, witch, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and ...
Identifying video spammers in online social networks
In many video social networks, including YouTube, users are permitted to post video responses to other users' videos. Such a response can be legitimate or can be a video response spam, which is a video response whose content is not related to the topic ...
A few bad votes too many?: towards robust ranking in social media
Online social media draws heavily on active reader participation, such as voting or rating of news stories, articles, or responses to a question. This user feedback is invaluable for ranking, filtering, and retrieving high quality content - tasks that ...
The anti-social tagger: detecting spam in social bookmarking systems
The annotation of web sites in social bookmarking systems has become a popular way to manage and find information on the web. The community structure of such systems attracts spammers: recent post pages, popular pages or specific tag pages can be ...
Robust PageRank and locally computable spam detection features
Since the link structure of the web is an important element in ranking systems on search engines, web spammers widely use the link structure of the web to increase the rank of their pages. Various link-based features of web pages have been introduced ...
Cited By
-
HUSSAIN O, BIN AHMAD M and ZAIDI F (2022). BENCHMARKING THE INFLUENTIAL NODES IN COMPLEX NETWORKS, Advances in Complex Systems, 10.1142/S0219525922500102, 25:07, Online publication date: 1-Nov-2022.
-
Usman U, Mahmood A and Wang L (2019). Robust Control Centrality 2019 Chinese Control Conference (CCC), 10.23919/ChiCC.2019.8866402, 978-9-8815-6397-2, (5486-5491)
-
El-Daghar O, Lundberg E and Bridges R (2018). EGBTER: Capturing Degree Distribution, Clustering Coefficients, and Community Structure in a Single Random Graph Model 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 10.1109/ASONAM.2018.8508598, 978-1-5386-6051-5, (282-289)
-
Zhuang X, Zhu Y, Chang C and Peng Q Feature bundling in decision tree algorithm, Intelligent Data Analysis, 10.3233/IDA-150322, 21:2, (371-383)
-
Erdélyi M, Benczúr A, Daróczy B, Garzó A, Kiss T and Siklósi D (2014). The Classification Power of Web Features, Internet Mathematics, 10.1080/15427951.2013.850456, 10:3-4, (421-457), Online publication date: 3-Jul-2014.
-
Goh K, Singh A and Lim K (2013). Multilayer perceptrons neural network based Web spam detection application 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), 10.1109/ChinaSIP.2013.6625419, 978-1-4799-1043-4, (636-640)
- Scarselli F, Tsoi A, Hagenbuchner M and Noi L (2013). Solving graph data issues using a layered architecture approach with applications to web spam detection, Neural Networks, 48, (78-90), Online publication date: 1-Dec-2013.
- Garzó A, Daróczy B, Kiss T, Siklósi D and Benczúr A Cross-lingual web spam classification Proceedings of the 22nd International Conference on World Wide Web, (1149-1156)
- Erdélyi M, Benczúr A, Masanés J and Siklósi D Web spam filtering in internet archives Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, (17-20)
- Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Recommendations
Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008)
WWW '08: Proceedings of the 17th international conference on World Wide WebAdversarial IR in general, and search engine spam, in particular, are engaging research topics with a real-world impact for Web users, advertisers and publishers. The AIRWeb workshop will bring researchers and practitioners in these areas together, to ...