skip to main content
10.1145/3459637.3482213acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Vandalism Detection in OpenStreetMap via User Embeddings

Published: 30 October 2021 Publication History

Abstract

OpenStreetMap (OSM) is a free and openly-editable database of geographic information. Over the years, OSM has evolved into the world's largest open knowledge base of geospatial data, and protecting OSM from the risk of vandalized and falsified information has become paramount to ensuring its continued success. However, despite the increasing usage of OSM and a wide interest in vandalism detection on open knowledge bases such as Wikipedia and Wikidata, OSM has not attracted as much attention from the research community, partially due to a lack of publicly available vandalism corpus. In this paper, we report on the construction of the first OSM vandalism corpus, and release it publicly. We describe a user embedding approach to create OSM user embeddings and add embedding features to a machine learning model to improve vandalism detection in OSM. We validate the model against our vandalism corpus, and observe solid improvements in key metrics. The validated model is deployed into production for vandalism detection on Daylight Map.

Supplementary Material

MP4 File (rgsp2320.mp4)
OpenStreetMap (OSM) is a free and openly-editable database of geographic information. Over the years, OSM has evolved into the world?s largest open knowledge base of geospatial data, and protecting OSM from the risk of vandalized information has become paramount to ensuring its continued success. However, despite the increasing usage of OSM and a wide interest in vandalism detection on open knowledge bases such as Wikipedia, OSM has not attracted as much attention from the research community, partially due to a lack of publicly available vandalism corpus. In this paper, we report on the construction of the first OSM vandalism corpus, and release it publicly. We describe a user embedding approach to create OSM user embeddings and add embedding features to a machine learning model to improve vandalism detection in OSM. We validate the model against our vandalism corpus, and observe solid improvements in key metrics. The validated model is deployed into production for vandalism detection on Daylight Map.

References

[1]
B Adler, Luca De Alfaro, and Ian Pye. 2010. Detecting wikipedia vandalism using wikitrust. Notebook papers of CLEF, Vol. 1 (2010), 22--23.
[2]
B Thomas Adler, Luca De Alfaro, Santiago M Mola-Velasco, Paolo Rosso, and Andrew G West. 2011. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 277--288.
[3]
Jennings Anderson. 2021. A 2021 Update on Paid Editing in OpenStreetMap. https://www.openstreetmap.org/user/Jennings%20Anderson/diary/396271 Retrieved May, 2021 from
[4]
Jennings Anderson, Dipto Sarkar, and Leysia Palen. 2019. Corporate Editors in the Evolving Landscape of OpenStreetMap. ISPRS International Journal of Geo-Information, Vol. 8, 5 (2019). https://doi.org/10.3390/ijgi8050232
[5]
Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1--6.
[6]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conf. on knowledge discovery and data mining. 785--794.
[7]
Rafael Crescenzi, Marcelo Fernandez, Federico A Garcia Calabria, Pablo Albani, Diego Tauziet, Adriana Baravalle, and Andrés Sebastián D'Ambrosio. 2017. A Production Oriented Approach for Vandalism Detection in Wikidata-The Buffaloberry Vandalism Detector at WSDM Cup 2017. arXiv preprint arXiv:1712.06919 (2017).
[8]
Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 311--320.
[9]
Alexey Grigorev. 2017. Large-Scale Vandalism Detection with Linear Classifiers-The Conkerberry Vandalism Detector at WSDM Cup 2017. arXiv preprint arXiv:1712.06920 (2017).
[10]
Stefan Heindorf, Martin Potthast, Gregor Engels, and Benno Stein. 2017. Overview of the wikidata vandalism detection task at wsdm cup 2017. arXiv preprint arXiv:1712.05956 (2017).
[11]
Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. 2015. Towards vandalism detection in knowledge bases: Corpus construction and analysis. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 831--834.
[12]
Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. 2016. Vandalism detection in wikidata. In Proceedings of the 25th ACM International on Conf. on Information and Knowledge Management. 327--336.
[13]
Kelly Y Itakura and Charles LA Clarke. 2009. Using dynamic markov compression to detect vandalism in the wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 822--823.
[14]
Srijan Kumar, Francesca Spezzano, and VS Subrahmanian. 2015. Vews: A wikipedia vandal early warning system. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 607--616.
[15]
Yinxiao Li. 2020. Handling Position Bias for Unbiased Learning to Rank in Hotels Search. arXiv preprint arXiv:2002.12528 (2020).
[16]
David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related pins at pinterest: The evolution of a real-world recommender system. In Proceedings of the 26th international conference on world wide web companion. 583--592.
[17]
Juan R Martinez-Rico, Juan Martinez-Romo, and Lourdes Araujo. 2019. Can deep learning techniques improve classification performance of vandalism detection in Wikipedia? Engineering Applications of Artificial Intelligence, Vol. 78 (2019), 248--259.
[18]
Kathleen McKeown and William Wang. 2010. ?Got You!": Automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In Proceedings of the 23rd International Conference on Computational Linguistics. 1146--1154.
[19]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[20]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).
[21]
Pascal Neis, Marcus Goetz, and Alexander Zipf. 2012. Towards automatic vandalism detection in OpenStreetMap. ISPRS International Journal of Geo-Information, Vol. 1, 3 (2012), 315--332.
[22]
OpenStreetMap. [n.d.]. Open Database License. https://wiki.osmfoundation.org/wiki/Licence Retrieved May, 2021 from
[23]
Martin Potthast. 2010. Crowdsourcing a Wikipedia vandalism corpus. In Proceedings of the 33rd international ACM SIGIR conf. on Research and development in information retrieval. 789--790.
[24]
Martin Potthast, Benno Stein, and Robert Gerling. 2008. Automatic vandalism detection in Wikipedia. In European conference on information retrieval. Springer, 663--668.
[25]
Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017. Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion. 1647--1654.
[26]
Koen Smets, Bart Goethals, and Brigitte Verdonk. 2008. Automatic vandalism detection in Wikipedia: Towards a machine learning approach. In AAAI workshop on Wikipedia and artificial intelligence: An Evolving Synergy. AAAI Press Antwerp, Belgium., 43--48.
[27]
Patricia Sol'is, Jennings Anderson, and Sushil Rajagopalan. 2020. Open geospatial tools for humanitarian data creation, analysis, and learning through the global lens of YouthMappers. Journal of Geographical Systems (2020), 1--27.
[28]
Khoi-Nguyen Tran and Peter Christen. 2014. Cross-language learning from bots and users to detect vandalism on wikipedia. IEEE Transactions on Knowledge and Data Engineering, Vol. 27, 3 (2014), 673--685.
[29]
Quy Thy Truong, Guillaume Touya, and Cyril de Runz. 2018. Towards Vandalism Detection in OpenStreetMap Through a Data Driven Approach. In GIScience 2018. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik.
[30]
Quy Thy Truong, Guillaume Touya, and Cyril de Runz. 2020. OSMWatchman: Learning How to Detect Vandalized Contributions in OSM Using a Random Forest Classifier. ISPRS International Journal of Geo-Information, Vol. 9, 9 (2020), 504.
[31]
Dongjing Wang, Shuiguang Deng, Xin Zhang, and Guandong Xu. 2016. Learning music embedding with metadata for context aware recommendation. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. 249--253.
[32]
Tomoya Yamazaki, Mei Sasaki, Naoya Murakami, Takuya Makabe, and Hiroki Iwasawa. 2017. Ensemble Models for Detecting Wikidata Vandalism with Stacking-Team Honeyberry Vandalism Detector at WSDM Cup 2017. arXiv preprint arXiv:1712.06921 (2017).
[33]
Shuhan Yuan, Panpan Zheng, Xintao Wu, and Yang Xiang. 2017. Wikipedia vandal early detection: from user behavior to user embedding. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 832--846.
[34]
Mihir Zaveri. 30 August 2018. New York City Is Briefly Labeled "Jewtropolis' on Snapchat and Other Apps. The New York Times ( 30 August 2018).
[35]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059--1068.
[36]
Qi Zhu, Hongwei Ng, Liyuan Liu, Ziwei Ji, Bingjie Jiang, Jiaming Shen, and Huan Gui. 2017. Wikidata Vandalism Detection-The Loganberry Vandalism Detector at WSDM Cup 2017. arXiv preprint arXiv:1712.06922 (2017).

Cited By

View all
  • (2024)How sustainable is OpenStreetMap? Tracking individual trajectories of editing behaviorInternational Journal of Digital Earth10.1080/17538947.2024.231132017:1Online publication date: 5-Feb-2024
  • (2023)A School of Thought on VGI Challenges: A Literature ReviewPapers in Applied Geography10.1080/23754931.2023.225634410:1(53-68)Online publication date: 3-Oct-2023
  • (2023)Mitigating Position Bias in Hotels Recommender SystemsAdvances in Bias and Fairness in Information Retrieval10.1007/978-3-031-37249-0_6(74-84)Online publication date: 15-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. embeddings
  2. machine learning
  3. openstreetmap
  4. vandalism

Qualifiers

  • Short-paper

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)How sustainable is OpenStreetMap? Tracking individual trajectories of editing behaviorInternational Journal of Digital Earth10.1080/17538947.2024.231132017:1Online publication date: 5-Feb-2024
  • (2023)A School of Thought on VGI Challenges: A Literature ReviewPapers in Applied Geography10.1080/23754931.2023.225634410:1(53-68)Online publication date: 3-Oct-2023
  • (2023)Mitigating Position Bias in Hotels Recommender SystemsAdvances in Bias and Fairness in Information Retrieval10.1007/978-3-031-37249-0_6(74-84)Online publication date: 15-Jul-2023
  • (2022)Attention-Based Vandalism Detection in OpenStreetMapProceedings of the ACM Web Conference 202210.1145/3485447.3512224(643-651)Online publication date: 25-Apr-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media