skip to main content
10.1145/2038558.2038573acmotherconferencesArticle/Chapter ViewAbstractPublication PageswikisymConference Proceedingsconference-collections
research-article

Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

Published: 03 October 2011 Publication History

Abstract

User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.

References

[1]
B. Adler, L. de Alfaro, and I. Pve. Detecting wikipedia vandalism using wikitrust. Technical report, PAN lab report, CLEF (Conference on Multilingual and Multimodal Information Access Evaluation), 2010.
[2]
B. T. Adler and L. de Alfaro. A content-driven reputation system for the wikipedia. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 261--270, New York, NY, USA, 2007. ACM.
[3]
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.
[4]
R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, ICML '06, pages 161--168, New York, NY, USA, 2006. ACM.
[5]
S. chi Chin, P. Srinivasan, W. N. Street, and D. Eichmann. Detecting wikipedia vandalism with active learning and statistical language models. In Fourth Workshop on Information Credibility on the Web (WICOW 2010), 2010.
[6]
J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1--22, 2010.
[7]
R. S. Geiger and D. Ribes. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, CSCW '10, pages 117--126, New York, NY, USA, 2010. ACM.
[8]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
[9]
K. Y. Itakura and C. L. A. Clarke. Using dynamic markov compression to detect vandalism in the wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pages 822--823, New York, NY, USA, 2009. ACM.
[10]
S. Javanmardi, C. Lopes, and P. Baldi. Modeling user reputation in wikipedia. Journal of Statistical Analysis and Data Mining, 3(2):126--139, 2010.
[11]
T. H. Martin Potthast, Benno Stein. Overview of the 1st international competition on wikipedia. In CLEF'2010, September 2010.
[12]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, pages 1--6, 2005.
[13]
M. Potthast. Crowdsourcing a wikipedia vandalism corpus. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 789--790, New York, NY, USA, 2010. ACM.
[14]
M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in wikipedia. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors, ECIR, volume 4956 of Lecture Notes in Computer Science, pages 663--668. Springer, 2008.
[15]
R. Priedhorsky, J. Chen, S. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring value in wikipedia. In GROUP '07: Proceedings of the 2007 International ACM Conference on Supporting group work, pages 259--268, New York, NY, USA, 2007. ACM.
[16]
K. Smets, B. Goethals, and B. Verdonk. Automatic vandalism detection in wikipedia: Towards a machine learning approach. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy (WikiAI08), pages 43--48. AAAI Press, 2008.
[17]
B. T. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Proceedings of Computational Linguistics and Intelligent Text Processing(CICLing'11), pages 266--276, 2011.
[18]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.
[19]
S. M. M. Velasco. Wikipedia vandalism detection through machine learning: feature review and new proposals. Technical report, PAN lab report, CLEF (Conference on Multilingual and Multimodal Information Access Evaluation), 2010.
[20]
F. B. Viegas, M. Wattenberg, and K. Dave. Studying cooperation and conflict between authors with history flow visualizations. In CHI '04: Proceedings of the SIGCHI Conference on Human factors in computing systems, pages 575--582, New York, NY, USA, 2004. ACM.

Cited By

View all
  • (2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
  • (2022)A Method for Social Network Extraction From E-GovernmentResearch Anthology on Social Media's Influence on Government, Politics, and Social Movements10.4018/978-1-6684-7472-3.ch012(224-243)Online publication date: 26-Aug-2022
  • (2022)An In-Depth Analysis of Occasional and Recurring Collaborations in Online Music Co-creationACM Transactions on Social Computing10.1145/34938004:4(1-40)Online publication date: 29-Jan-2022
  • Show More Cited By

Index Terms

  1. Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration
    October 2011
    245 pages
    ISBN:9781450309097
    DOI:10.1145/2038558
    • Conference Chair:
    • Felipe Ortega,
    • Program Chair:
    • Andrea Forte
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • TJEF: The John Ernest Foundation

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 October 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Lasso
    2. Wikipedia
    3. random forests
    4. vandalism detection

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WikiSym '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 69 of 145 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
    • (2022)A Method for Social Network Extraction From E-GovernmentResearch Anthology on Social Media's Influence on Government, Politics, and Social Movements10.4018/978-1-6684-7472-3.ch012(224-243)Online publication date: 26-Aug-2022
    • (2022)An In-Depth Analysis of Occasional and Recurring Collaborations in Online Music Co-creationACM Transactions on Social Computing10.1145/34938004:4(1-40)Online publication date: 29-Jan-2022
    • (2022)A Survey on Detecting Vandalism in Crowdsourcing Models2022 International Conference on Data Science and Intelligent Computing (ICDSIC)10.1109/ICDSIC56987.2022.10076011(25-30)Online publication date: 1-Nov-2022
    • (2021)Extracting and Visualizing User Engagement on Wikipedia Talk PagesProceedings of the 17th International Symposium on Open Collaboration10.1145/3479986.3479995(1-12)Online publication date: 15-Sep-2021
    • (2020)Social networking mood recognition algorithm for conflict detection and management of Indian educational institutionsSocial Network Analysis and Mining10.1007/s13278-020-00701-310:1Online publication date: 1-Nov-2020
    • (2019)A Method for Social Network Extraction From E-GovernmentInternational Journal of Information Systems in the Service Sector10.4018/IJISSS.201907010311:3(37-55)Online publication date: 1-Jul-2019
    • (2019)Debiasing Vandalism Detection Models at WikidataThe World Wide Web Conference10.1145/3308558.3313507(670-680)Online publication date: 13-May-2019
    • (2019)Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia2019 Systems and Information Engineering Design Symposium (SIEDS)10.1109/SIEDS.2019.8735592(1-6)Online publication date: Apr-2019
    • (2018)Can Who-Edits-What Predict Edit Survival?Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219979(2604-2613)Online publication date: 19-Jul-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media