research-article

Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

Authors:

Sara Javanmardi,

David W. McDonald,

Cristina V. LopesAuthors Info & Claims

WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration

Pages 82 - 90

https://doi.org/10.1145/2038558.2038573

Published: 03 October 2011 Publication History

Abstract

User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.

References

[1]

B. Adler, L. de Alfaro, and I. Pve. Detecting wikipedia vandalism using wikitrust. Technical report, PAN lab report, CLEF (Conference on Multilingual and Multimodal Information Access Evaluation), 2010.

[2]

B. T. Adler and L. de Alfaro. A content-driven reputation system for the wikipedia. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 261--270, New York, NY, USA, 2007. ACM.

Digital Library

[3]

L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.

Digital Library

[4]

R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, ICML '06, pages 161--168, New York, NY, USA, 2006. ACM.

Digital Library

[5]

S. chi Chin, P. Srinivasan, W. N. Street, and D. Eichmann. Detecting wikipedia vandalism with active learning and statistical language models. In Fourth Workshop on Information Credibility on the Web (WICOW 2010), 2010.

Digital Library

[6]

J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1--22, 2010.

[7]

R. S. Geiger and D. Ribes. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, CSCW '10, pages 117--126, New York, NY, USA, 2010. ACM.

Digital Library

[8]

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.

[9]

K. Y. Itakura and C. L. A. Clarke. Using dynamic markov compression to detect vandalism in the wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pages 822--823, New York, NY, USA, 2009. ACM.

Digital Library

[10]

S. Javanmardi, C. Lopes, and P. Baldi. Modeling user reputation in wikipedia. Journal of Statistical Analysis and Data Mining, 3(2):126--139, 2010.

Digital Library

[11]

T. H. Martin Potthast, Benno Stein. Overview of the 1st international competition on wikipedia. In CLEF'2010, September 2010.

[12]

G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, pages 1--6, 2005.

[13]

M. Potthast. Crowdsourcing a wikipedia vandalism corpus. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 789--790, New York, NY, USA, 2010. ACM.

Digital Library

[14]

M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in wikipedia. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors, ECIR, volume 4956 of Lecture Notes in Computer Science, pages 663--668. Springer, 2008.

Digital Library

[15]

R. Priedhorsky, J. Chen, S. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring value in wikipedia. In GROUP '07: Proceedings of the 2007 International ACM Conference on Supporting group work, pages 259--268, New York, NY, USA, 2007. ACM.

Digital Library

[16]

K. Smets, B. Goethals, and B. Verdonk. Automatic vandalism detection in wikipedia: Towards a machine learning approach. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy (WikiAI08), pages 43--48. AAAI Press, 2008.

[17]

B. T. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Proceedings of Computational Linguistics and Intelligent Text Processing(CICLing'11), pages 266--276, 2011.

Digital Library

[18]

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.

[19]

S. M. M. Velasco. Wikipedia vandalism detection through machine learning: feature review and new proposals. Technical report, PAN lab report, CLEF (Conference on Multilingual and Multimodal Information Access Evaluation), 2010.

[20]

F. B. Viegas, M. Wattenberg, and K. Dave. Studying cooperation and conflict between authors with history flow visualizations. In CHI '04: Proceedings of the SIGCHI Conference on Human factors in computing systems, pages 575--582, New York, NY, USA, 2004. ACM.

Digital Library

Cited By

García-Méndez SLeal FMalheiro BBurguillo-Rial J(2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3342472
Alguliyev RAliguliyev RNiftaliyeva (Iskandarli) G(2022)A Method for Social Network Extraction From E-GovernmentResearch Anthology on Social Media's Influence on Government, Politics, and Social Movements10.4018/978-1-6684-7472-3.ch012(224-243)Online publication date: 26-Aug-2022
https://doi.org/10.4018/978-1-6684-7472-3.ch012
Calefato FIaffaldano GTrisolini LLanubile F(2022)An In-Depth Analysis of Occasional and Recurring Collaborations in Online Music Co-creationACM Transactions on Social Computing10.1145/34938004:4(1-40)Online publication date: 29-Jan-2022
https://dl.acm.org/doi/10.1145/3493800
Show More Cited By

Index Terms

Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso
1. Information systems
  1. Information systems applications

Recommendations

Detecting Wikipedia vandalism with active learning and statistical language models
WICOW '10: Proceedings of the 4th workshop on Information credibility

This paper proposes an active learning approach using language model statistics to detect Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility ...
Crowdsourcing a wikipedia vandalism corpus
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon's Mechanical Turk. The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified. 753 human annotators ...
Detecting wikipedia vandalism with a contributing efficiency-based approach
WISE'12: Proceedings of the 13th international conference on Web Information Systems Engineering

The collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration

October 2011

245 pages

ISBN:9781450309097

DOI:10.1145/2038558

Conference Chair:
Felipe Ortega
University Rey Juan Carlos, Madrid, Spain
,
Program Chair:
Andrea Forte
Drexel University, Philadelphia

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

TJEF: The John Ernest Foundation

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Office of Cyberinfrastructure

Conference

WikiSym '11

Sponsor:

WikiSym '11: The 7th International Symposium on Wikis and Open Collaboration

October 3 - 5, 2011

California, Mountain View

Acceptance Rates

Overall Acceptance Rate 69 of 145 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
250
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

García-Méndez SLeal FMalheiro BBurguillo-Rial J(2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3342472
Alguliyev RAliguliyev RNiftaliyeva (Iskandarli) G(2022)A Method for Social Network Extraction From E-GovernmentResearch Anthology on Social Media's Influence on Government, Politics, and Social Movements10.4018/978-1-6684-7472-3.ch012(224-243)Online publication date: 26-Aug-2022
https://doi.org/10.4018/978-1-6684-7472-3.ch012
Calefato FIaffaldano GTrisolini LLanubile F(2022)An In-Depth Analysis of Occasional and Recurring Collaborations in Online Music Co-creationACM Transactions on Social Computing10.1145/34938004:4(1-40)Online publication date: 29-Jan-2022
https://dl.acm.org/doi/10.1145/3493800
Mohammed Ali AAlwan HAl-Shakarchy N(2022)A Survey on Detecting Vandalism in Crowdsourcing Models2022 International Conference on Data Science and Intelligent Computing (ICDSIC)10.1109/ICDSIC56987.2022.10076011(25-30)Online publication date: 1-Nov-2022
https://doi.org/10.1109/ICDSIC56987.2022.10076011
MacKenzie CHott J(2021)Extracting and Visualizing User Engagement on Wikipedia Talk PagesProceedings of the 17th International Symposium on Open Collaboration10.1145/3479986.3479995(1-12)Online publication date: 15-Sep-2021
https://dl.acm.org/doi/10.1145/3479986.3479995
Sengupta SVaish A(2020)Social networking mood recognition algorithm for conflict detection and management of Indian educational institutionsSocial Network Analysis and Mining10.1007/s13278-020-00701-310:1Online publication date: 1-Nov-2020
https://doi.org/10.1007/s13278-020-00701-3
Alguliyev RAliguliyev RNiftaliyeva (Iskandarli) G(2019)A Method for Social Network Extraction From E-GovernmentInternational Journal of Information Systems in the Service Sector10.4018/IJISSS.201907010311:3(37-55)Online publication date: 1-Jul-2019
https://doi.org/10.4018/IJISSS.2019070103
Heindorf SScholten YEngels GPotthast M(2019)Debiasing Vandalism Detection Models at WikidataThe World Wide Web Conference10.1145/3308558.3313507(670-680)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313507
Rawat CSarkar ASingh SAlvarado RRasberry L(2019)Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia2019 Systems and Information Engineering Design Symposium (SIEDS)10.1109/SIEDS.2019.8735592(1-6)Online publication date: Apr-2019
https://doi.org/10.1109/SIEDS.2019.8735592
Yardim AKristof VMaystre LGrossglauser MGuo YFarooq F(2018)Can Who-Edits-What Predict Edit Survival?Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219979(2604-2613)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219979
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten