Before the advent of the World Wide Web, information retrieval algorithms were developed for relatively small and coherent document collections such as newspaper articles or book catalogs in a library. In comparison to these collections, the Web is massive, much less coherent, changes more rapidly, and is spread over geographically distributed computers. Scaling information retrieval algorithms to the World Wide Web is a challenging task. Success to date is depicted by the ubiquitous use of search engines to access Internet content.
Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving, and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a favorable position in search engine result pages is strongly correlated with more traffic, which often translates to more revenue.
As in previous years, automatic detection of search engine spam was the dominant theme of this workshop. A significant fraction of the accepted papers utilized temporal information to aid in detection of adversarial behavior. In addition to short and long papers that had been accepted in previous years, this year we introduced an additional category: position papers on challenges in Adversarial Information Retrieval, and we were excited to have two papers accepted in that category, as we believe in their potential to stimulate discussion at the workshop and beyond.
Proceeding Downloads
Looking into the past to better classify web spam
Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue ...
A study of link farm distribution and evolution using a time series of web snapshots
In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected ...
Web spam filtering in internet archives
While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but ...
Web spam identification through language model analysis
This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even ...
An empirical study on selective sampling in active learning for splog detection
This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / ...
Linked latent Dirichlet allocation in web spam filtering
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA ...
Social spam detection
The popularity of social bookmarking sites has made them prime targets for spammers. Many of these systems require an administrator's time and energy to manually filter or remove spam. Here we discuss the motivations of social spam, and present a study ...
Tag spam creates large non-giant connected components
Spammers in social bookmarking systems try to mimick bookmarking behaviour of real users to gain the attention of other users or search engines. Several methods have been proposed for the detection of such spam, including domain-specific features (like ...
Nullification test collections for web spam and SEO
Research in the area of adversarial information retrieval has been facilitated by the availability of the UK-2006/UK-2007 collections, comprising crawl data, link graph, and spam labels. However, research into nullifying the negative effect of spam or ...
Web spam challenge proposal for filtering in archives
In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over ...
Cited By
-
Papadopoulos S, Bontcheva K, Jaho E, Lupu M and Castillo C (2016). Overview of the Special Issue on Trust and Veracity of Information in Social Media, ACM Transactions on Information Systems, 10.1145/2870630, 34:3, (1-5), Online publication date: 11-Apr-2016.
- Daroczy B, Siklois D, Palovics R and Benczur A Text Classification Kernels for Quality Prediction over the C3 Data Set Proceedings of the 24th International Conference on World Wide Web, (1441-1446)
- Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Recommendations
Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008)
WWW '08: Proceedings of the 17th international conference on World Wide WebAdversarial IR in general, and search engine spam, in particular, are engaging research topics with a real-world impact for Web users, advertisers and publishers. The AIRWeb workshop will bring researchers and practitioners in these areas together, to ...