research-article

Exploring linguistic features for web spam detection: a preliminary study

Authors:

Jakub Piskorski,

Dawid WeissAuthors Info & Claims

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Pages 25 - 28

https://doi.org/10.1145/1451983.1451990

Published: 22 April 2008 Publication History

Abstract

We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.

References

[1]

J. Abernethy, O. Chapelle, and C. Castillo. Witch: A new approach to web spam detection, 2007. submitted.

[2]

A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of AIRWeb 2007, pages 89--92, New York, NY, USA, 2007. ACM.

Digital Library

[3]

C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR '07: Proceedings of the 30th ACM SIGIR conference, Amsterdam, The Netherlands, pages 423--430. ACM, 2007.

Digital Library

[4]

I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of ECML 2005, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.

Digital Library

[5]

T. Erjavec. MULTEXT -- East Morphosyntactic Specifications, 2004. URL: http://nl.ijs.si/ME/V3/msd/html.

[6]

A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of LREC 2006, pages 417--422, Genova, IT, 2006.

[7]

D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, New York, USA, 2004.

Digital Library

[8]

D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of SIGIR '05, pages 170--177, New York, NY, USA, 2005. ACM.

Digital Library

[9]

Jakub Piskorski. Corleone - Core Linguistic Entity Extraction. Technical Report. JRC of the European Commission, 2008.

[10]

G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of AIRWeb 2005, May 2005.

[11]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of WWW 2006, Edinburgh, Scotland, pages 83--92, 2006.

Digital Library

[12]

M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007. submitted for publication in IOS Press.

[13]

T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam with hidden style similarity. In AIRWeb 2006, pages 25--31, 2006.

[14]

Webspam corpora. URL: http://yr-bcn.es/webspam/datasets, accessed February 21, 2008.

[15]

A. Zhou, J. Burgoon, J. Nunamaker, and D. Twitchell. Automating Linguistics-Based Cues for Detecting Deception of Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiations, 12:81--106, 2004.

Cited By

Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Charanarur PJain HRao GSamanta DSengar SHewage C(2023)Machine-Learning-Based Spam Mail DetectorSN Computer Science10.1007/s42979-023-02330-x4:6Online publication date: 8-Nov-2023
https://doi.org/10.1007/s42979-023-02330-x
Wu CZhang RGuo JChen WFan Yde Rijke MCheng XAl Hasan MXiong L(2022)Certified Robustness to Word Substitution Ranking Attack for Neural Ranking ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557256(2128-2137)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557256
Show More Cited By

Index Terms

Exploring linguistic features for web spam detection: a preliminary study
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Content-based analysis to detect Arabic web spam

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with ...
Research on Web Spam Detection Based on Support Vector Machine
CSNT '12: Proceedings of the 2012 International Conference on Communication Systems and Network Technologies

With the fast development of Internet, web pages created by web spam which aimed at cheating the search engine and increasing rankings in the search results are prevailing. Web spam is a big problem for today's search engine; therefore it is necessary ...
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide Web

Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

April 2008

81 pages

ISBN:9781605581590

DOI:10.1145/1451983

Editors:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Dennis Fetterly
Microsoft Research

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AIRWeb'08

AIRWeb'08: AIRWeb '08, Third International Workshop on Adversarial Information Retrieval on the Web

April 22, 2008

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
414
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Charanarur PJain HRao GSamanta DSengar SHewage C(2023)Machine-Learning-Based Spam Mail DetectorSN Computer Science10.1007/s42979-023-02330-x4:6Online publication date: 8-Nov-2023
https://doi.org/10.1007/s42979-023-02330-x
Wu CZhang RGuo JChen WFan Yde Rijke MCheng XAl Hasan MXiong L(2022)Certified Robustness to Word Substitution Ranking Attack for Neural Ranking ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557256(2128-2137)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557256
Li Y(2022)Towards Forecasting Internet Financial Frauds based on Advertising2022 8th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA56350.2022.9874049(5-11)Online publication date: 24-Aug-2022
https://doi.org/10.1109/BigDIA56350.2022.9874049
Banerjee S(2022)Exaggeration in fake vs. authentic online reviews for luxury and budget hotelsInternational Journal of Information Management10.1016/j.ijinfomgt.2021.10241662(102416)Online publication date: Feb-2022
https://doi.org/10.1016/j.ijinfomgt.2021.102416
Man Chuah KIswandi N(2021)Hantaran Bahasa Melayu yang Tular di Facebook: Analisis dari Aspek Kandungan dan Atribut LinguistikJurnal Bahasa10.37052/jb21(2)no321:2(217-240)Online publication date: 5-Dec-2021
https://doi.org/10.37052/jb21(2)no3
Shahzad ANawi NRehman MKhan A(2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6625739
Liu JSu YLv SHuang C(2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1155/2020/6662166
Luckner MKozik R(2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1155/2019/6587020
M. Genovese LGeraci F(2018)Web Crawling and Processing with Limited Resources for Business Intelligence and Analytics ApplicationsJournal of Software10.17706/jsw.13.5.300-31613:5(300-316)Online publication date: May-2018
https://doi.org/10.17706/jsw.13.5.300-316
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten