research-article

Detecting Cross-Lingual Information Gaps in Wikipedia

Author:
Vahid Ashrafimoghari

Stevens Institute of Technology, USA

Stevens Institute of Technology, USA

0000-0002-2687-5513
View Profile

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023April 2023Pages 581–585https://doi.org/10.1145/3543873.3587539

Published:30 April 2023Publication History

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 581–585

ABSTRACT

An information gap exists across Wikipedia’s language editions, with a considerable proportion of articles available in only a few languages. As an illustration, it has been observed that 10 languages possess half of the available Wikipedia articles, despite the existence of 330 Wikipedia language editions. To address this issue, this study presents an approach to identify the information gap between the different language editions of Wikipedia. The proposed approach employs Latent Dirichlet Allocation (LDA) to analyze linked entities in a cross-lingual knowledge graph in order to determine topic distributions for Wikipedia articles in 28 languages. The distance between paired articles across language editions is then calculated. The potential applications of the proposed algorithm to detecting sources of information disparity in Wikipedia are discussed, and directions for future research are put forward.

References

Fakhare Alam, Muhammad Afzal, and Khalid Mahmood Malik. 2020. Comparative analysis of semantic similarity techniques for medical text. In 2020 International Conference on Information Networking (ICOIN). IEEE, 106–109.Google ScholarCross Ref
Hiteshwar Kumar Azad and Akshay Deepak. 2019. A new approach for query expansion using Wikipedia and WordNet. Information sciences 492 (2019), 147–163.Google Scholar
Vevake Balaraman, Simon Razniewski, and Werner Nutt. 2018. Recoin: relative completeness in Wikidata. In Companion Proceedings of the The Web Conference 2018. 1787–1792.Google ScholarDigital Library
Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: bridging the wikipedia language gap. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1075–1084.Google ScholarDigital Library
Alberto Barrón-Cedeno, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014. Proceedings 36. Springer, 424–429.Google ScholarCross Ref
Hamed Bonab, Sheikh Muhammad Sarwar, and James Allan. 2020. Training effective neural CLIR by bridging the translation gap. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 9–18.Google ScholarDigital Library
Ewa S Callahan and Susan C Herring. 2011. Cultural bias in Wikipedia content on famous persons. Journal of the American society for information science and technology 62, 10 (2011), 1899–1915.Google ScholarDigital Library
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017).Google Scholar
Juryong Cheon and Youngjoong Ko. 2021. Parallel sentence extraction to improve cross-language information retrieval from Wikipedia. Journal of Information Science 47, 2 (2021), 281–293.Google ScholarDigital Library
Filippo Chiarello, Leonello Trivelli, Andrea Bonaccorsi, and Gualtiero Fantoni. 2018. Extracting and mapping industry 4.0 technologies using wikipedia. Computers in Industry 100 (2018), 244–257.Google ScholarCross Ref
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. 2020. Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180 (2020).Google Scholar
Dan Cosley, Dan Frankowski, Loren Terveen, and John Riedl. 2007. SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces. 32–41.Google ScholarDigital Library
Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny Vrandečić. 2014. Introducing wikidata to the linked data web. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I 13. Springer, 50–65.Google ScholarDigital Library
Wikimedia Foundation. 2023. Vietnamese Wikipedia. https://en.wikipedia.org/wiki/Vietnamese_Wikipea (Accessed: 02/03/2023)Google Scholar
Wikimedia Foundation. 2023. Wikipedia:Statistics. https://en.wikipedia.org/wiki/Wikipedia:Statistics (Accessed: 02/03/2023)Google Scholar
Evgeniy Gabrilovich, Shaul Markovitch, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis.. In IJcAI, Vol. 7. 1606–1611.Google Scholar
Lukas Galke, Ahmed Saleh, and Ansgar Scherp. 2017. Word embeddings for practical information retrieval. In Informatik 2017. Gesellschaft für Informatik, 2155–2167.Google Scholar
Goran Glavaš, Marc Franco-Salvador, Simone P Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems 143 (2018), 1–9.Google Scholar
Joaquin Gómez and Pere-Pau Vázquez. 2022. An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles. Applied Sciences 12, 11 (2022), 5664.Google ScholarCross Ref
Brent Hecht and Darren Gergle. 2009. Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth international conference on communities and technologies. 11–20.Google ScholarDigital Library
Isaac Johnson and Emily Lescak. 2022. Considerations for Multilingual Wikipedia Research. arXiv preprint arXiv:2204.02483 (2022).Google Scholar
Lucie-Aimée Kaffee, Pavlos Vougiouklis, and Elena Simperl. 2022. Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective. Semantic Web 13, 2 (2022), 163–194.Google ScholarCross Ref
Sowmya Lakshmi and BR Shambhavi. 2020. Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages. Journal of Computational and Theoretical Nanoscience 17, 1 (2020), 54–60.Google ScholarCross Ref
Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 14, 3 (2015), 1–22.Google ScholarDigital Library
Saket Maheshwary and Hemant Misra. 2018. Matching resumes to jobs via deep siamese network. In Companion Proceedings of the The Web Conference 2018. 87–88.Google ScholarDigital Library
Nhu Nguyen, Dung Cao, and Anh Nguyen. 2018. Automatically mapping Wikipedia infobox attributes to DBpedia properties for fast deployment of Vietnamese DBpedia chapter. In Asian Conference on Intelligent Information and Database Systems. Springer, 127–136.Google ScholarCross Ref
Chien-Chun Ni, Kin Sum Liu, and Nicolas Torzec. 2020. Layered graph embedding for entity recommendation using wikipedia in the yahoo! knowledge graph. In Companion Proceedings of the Web Conference 2020. 811–818.Google ScholarDigital Library
Jian-Yun Nie. 2010. Cross-language information retrieval. Vol. 8. Morgan & Claypool Publishers.Google Scholar
Tiziano Piccardi and Robert West. 2021. Crosslingual topic modeling with WikiPDA. In Proceedings of the Web Conference 2021. 3032–3041.Google ScholarDigital Library
Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020. A topic-aligned multilingual corpus of wikipedia articles for studying information asymmetry in low resource languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2373–2380.Google Scholar
Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2022. Information asymmetry in Wikipedia across different languages: A statistical analysis. Journal of the Association for Information Science and Technology 73, 3 (2022), 347–361.Google ScholarDigital Library
Motaz Saad, David Langlois, and Kamel Smaïli. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia-Social and Behavioral Sciences 95 (2013), 40–47.Google ScholarCross Ref
Yogesh Sankarasubramaniam, Krishnan Ramanathan, and Subhankar Ghosh. 2014. Text summarization using Wikipedia. Information Processing & Management 50, 3 (2014), 443–461.Google ScholarCross Ref
Procheta Sen, Debasis Ganguly, and Gareth Jones. 2019. Word-Node2Vec: Improving word embedding with document-level non-local word co-occurrences. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1041–1051.Google Scholar
Tan Thongtan and Tanasanee Phienthrakul. 2019. Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 407–414.Google ScholarCross Ref
Pu Wang and Carlotta Domeniconi. 2008. Building semantic kernels for text classification using wikipedia. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 713–721.Google ScholarDigital Library
Kyle Wilson. 2020-02-11. The World’s Second Largest Wikipedia Is Written Almost Entirely by One Bot. https://www.vice.com/en/article/4agamm/the-worlds-second-largest-wikipedia-is-written-almost-entirely-by-one-bot (Accessed: 02/03/2023)Google Scholar
Samuel C Woolley and Philip N Howard. 2018. Computational propaganda: Political parties, politicians, and political manipulation on social media. Oxford University Press.Google Scholar
Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web. 975–985.Google ScholarDigital Library

Index Terms

Detecting Cross-Lingual Information Gaps in Wikipedia
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
  2. World Wide Web
    1. Web mining

Recommendations

Cross lingual text classification by mining multilingual topics from wikipedia
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages,...
Read More
Cross-media topic mining on wikipedia
MM '13: Proceedings of the 21st ACM international conference on Multimedia

As a collaborative wiki-based encyclopedia, Wikipedia provides a huge amount of articles of various categories. In addition to their text corpus, Wikipedia also contains plenty of images which makes the articles more intuitive for readers to understand. ...
Read More
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
April 2023
1567 pages
ISBN:9781450394192
DOI:10.1145/3543873
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Computational linguistics
Topic Modeling
Wikipedia
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 109
  Total Downloads
- Downloads (Last 12 months)109
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Detecting Cross-Lingual Information Gaps in Wikipedia

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cross lingual text classification by mining multilingual topics from wikipedia

Cross-media topic mining on wikipedia

Learning multilingual named entity recognition from Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Detecting Cross-Lingual Information Gaps in Wikipedia

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cross lingual text classification by mining multilingual topics from wikipedia

Cross-media topic mining on wikipedia

Learning multilingual named entity recognition from Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media