Skip to main content

WikiDetect: Automatic Vandalism Detection for Wikipedia Using Linguistic Features

  • Conference paper
Book cover Computational Collective Intelligence. Technologies and Applications (ICCCI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8083))

Included in the following conference series:

  • 2035 Accesses

Abstract

Vandalism of the content has always been one of the greatest problems for Wikipedia, yet only few completely automatic solutions for solving it have been developed so far. Volunteers still spend large amounts of time correcting vandalized page edits, instead of using this time to improve the quality of the content of articles. The purpose of this paper is to introduce a new vandalism detection system, that only uses natural language processing and machine learning techniques. The system has been evaluated on a corpus of real vandalized data in order to test its performance and justify the design choices. The same expert annotated wikitext, extracted from the encyclopedia’s database, is used to evaluate different vandalism detection algorithms. The paper presents a critical analysis of the obtained results, comparing them to existing solutions, and suggests different statistical classification methods that bring several improvements to the task at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Potthast, M., Stein, B., Gerling, R.: Automatic Vandalism Detection in Wikipedia. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 663–668. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  2. Adler, B.T., de Alfaro, L., Pye, I.: Detecting Wikipedia Vandalism using WikiTrust. In: Proceedings of the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (2010)

    Google Scholar 

  3. Freund, Y., Mason, L.: The Alternating Decision Tree Algorithm. In: Proceedings of the 16th International Conference on Machine Learning (1999)

    Google Scholar 

  4. Harpalani, M., Phumprao, T., Bassi, M., Hart, M., Johnson, R.: Wiki Vandalysis - Wikipedia Vandalism Analysis. In: Proceedings of the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (2010)

    Google Scholar 

  5. West, A.G., Kannan, S., Lee, I.: Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata. In: Proceedings of the Third European Workshop on System Security EUROSEC (2010)

    Google Scholar 

  6. Potthast, M.: Crowdsourcing a Wikipedia Vandalism Corpus. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2010)

    Google Scholar 

  7. Wikipedia vandalism policy, http://en.wikipedia.org/wiki/Wikipedia:Vandalism

  8. Java CSV library, http://sourceforge.net/projects/javacsv/

  9. Diff, Match and Patch library, http://code.google.com/p/google-diff-match-patch

  10. Apache Lucene, http://lucene.apache.org/core/

  11. WordNet – a lexical database for English, http://wordnet.princeton.edu

  12. WS4J library, http://ws4j.googlecode.com

  13. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of International Conference Research on Computational Linguistics, ROCLING X (1997)

    Google Scholar 

  14. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  15. SVM Weka, http://cns.bu.edu/~gsc/CN710/pmwiki.php?n=Main.SVMWeka

  16. PAN 2011 conference website, http://www.webis.de/research/events/pan-11

  17. Mola Velasco, S.M.: Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals. Notebook Papers of CLEF 2010 LABs and Workshops (2010)

    Google Scholar 

  18. Javanmardi, S.: Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. In: Proceedings of the 7th International Symposium on Wikis and Open Collaboration (2011)

    Google Scholar 

  19. Chichkov, D.: Submission to the 1st International Competition on Wikipedia Vandalism Detection. SC Software Inc. (2010)

    Google Scholar 

  20. Seaward, L.: Submission to the 1st International Competition on Wikipedia Vandalism Detection. Universtiy of Ottawa (2010)

    Google Scholar 

  21. Hegedus, I., Ormándi, R., Farkas, R., Jelasity, M.: Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task: Lab Report for PAN at CLEF 2010 (2010)

    Google Scholar 

  22. Drăguşanu, C.-A., Cufliuc, M., Iftene, A.: Detecting Wikipedia Vandalism using Machine Learning. Notebook Paper for the CLEF 2011 LABs Workshop (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cioiu, D., Rebedea, T. (2013). WikiDetect: Automatic Vandalism Detection for Wikipedia Using Linguistic Features. In: BÇŽdicÇŽ, C., Nguyen, N.T., Brezovan, M. (eds) Computational Collective Intelligence. Technologies and Applications. ICCCI 2013. Lecture Notes in Computer Science(), vol 8083. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40495-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40495-5_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40494-8

  • Online ISBN: 978-3-642-40495-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics