skip to main content
10.1145/1451983.1451990acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Exploring linguistic features for web spam detection: a preliminary study

Published: 22 April 2008 Publication History

Abstract

We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.

References

[1]
J. Abernethy, O. Chapelle, and C. Castillo. Witch: A new approach to web spam detection, 2007. submitted.
[2]
A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of AIRWeb 2007, pages 89--92, New York, NY, USA, 2007. ACM.
[3]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR '07: Proceedings of the 30th ACM SIGIR conference, Amsterdam, The Netherlands, pages 423--430. ACM, 2007.
[4]
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of ECML 2005, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.
[5]
T. Erjavec. MULTEXT -- East Morphosyntactic Specifications, 2004. URL: http://nl.ijs.si/ME/V3/msd/html.
[6]
A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of LREC 2006, pages 417--422, Genova, IT, 2006.
[7]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, New York, USA, 2004.
[8]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of SIGIR '05, pages 170--177, New York, NY, USA, 2005. ACM.
[9]
Jakub Piskorski. Corleone - Core Linguistic Entity Extraction. Technical Report. JRC of the European Commission, 2008.
[10]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of AIRWeb 2005, May 2005.
[11]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of WWW 2006, Edinburgh, Scotland, pages 83--92, 2006.
[12]
M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007. submitted for publication in IOS Press.
[13]
T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam with hidden style similarity. In AIRWeb 2006, pages 25--31, 2006.
[14]
Webspam corpora. URL: http://yr-bcn.es/webspam/datasets, accessed February 21, 2008.
[15]
A. Zhou, J. Burgoon, J. Nunamaker, and D. Twitchell. Automating Linguistics-Based Cues for Detecting Deception of Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiations, 12:81--106, 2004.

Cited By

View all
  • (2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
  • (2023)Machine-Learning-Based Spam Mail DetectorSN Computer Science10.1007/s42979-023-02330-x4:6Online publication date: 8-Nov-2023
  • (2022)Certified Robustness to Word Substitution Ranking Attack for Neural Ranking ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557256(2128-2137)Online publication date: 17-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web
April 2008
81 pages
ISBN:9781605581590
DOI:10.1145/1451983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content features
  2. linguistic features
  3. web spam
  4. web spam detection

Qualifiers

  • Research-article

Conference

AIRWeb'08

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
  • (2023)Machine-Learning-Based Spam Mail DetectorSN Computer Science10.1007/s42979-023-02330-x4:6Online publication date: 8-Nov-2023
  • (2022)Certified Robustness to Word Substitution Ranking Attack for Neural Ranking ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557256(2128-2137)Online publication date: 17-Oct-2022
  • (2022)Towards Forecasting Internet Financial Frauds based on Advertising2022 8th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA56350.2022.9874049(5-11)Online publication date: 24-Aug-2022
  • (2022)Exaggeration in fake vs. authentic online reviews for luxury and budget hotelsInternational Journal of Information Management10.1016/j.ijinfomgt.2021.10241662(102416)Online publication date: Feb-2022
  • (2021)Hantaran Bahasa Melayu yang Tular di Facebook: Analisis dari Aspek Kandungan dan Atribut LinguistikJurnal Bahasa10.37052/jb21(2)no321:2(217-240)Online publication date: 5-Dec-2021
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • (2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
  • (2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
  • (2018)Web Crawling and Processing with Limited Resources for Business Intelligence and Analytics ApplicationsJournal of Software10.17706/jsw.13.5.300-31613:5(300-316)Online publication date: May-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media