skip to main content
10.1145/3543873.3587539acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Detecting Cross-Lingual Information Gaps in Wikipedia

Published:30 April 2023Publication History

ABSTRACT

An information gap exists across Wikipedia’s language editions, with a considerable proportion of articles available in only a few languages. As an illustration, it has been observed that 10 languages possess half of the available Wikipedia articles, despite the existence of 330 Wikipedia language editions. To address this issue, this study presents an approach to identify the information gap between the different language editions of Wikipedia. The proposed approach employs Latent Dirichlet Allocation (LDA) to analyze linked entities in a cross-lingual knowledge graph in order to determine topic distributions for Wikipedia articles in 28 languages. The distance between paired articles across language editions is then calculated. The potential applications of the proposed algorithm to detecting sources of information disparity in Wikipedia are discussed, and directions for future research are put forward.

References

  1. Fakhare Alam, Muhammad Afzal, and Khalid Mahmood Malik. 2020. Comparative analysis of semantic similarity techniques for medical text. In 2020 International Conference on Information Networking (ICOIN). IEEE, 106–109.Google ScholarGoogle ScholarCross RefCross Ref
  2. Hiteshwar Kumar Azad and Akshay Deepak. 2019. A new approach for query expansion using Wikipedia and WordNet. Information sciences 492 (2019), 147–163.Google ScholarGoogle Scholar
  3. Vevake Balaraman, Simon Razniewski, and Werner Nutt. 2018. Recoin: relative completeness in Wikidata. In Companion Proceedings of the The Web Conference 2018. 1787–1792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: bridging the wikipedia language gap. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1075–1084.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alberto Barrón-Cedeno, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014. Proceedings 36. Springer, 424–429.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hamed Bonab, Sheikh Muhammad Sarwar, and James Allan. 2020. Training effective neural CLIR by bridging the translation gap. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 9–18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ewa S Callahan and Susan C Herring. 2011. Cultural bias in Wikipedia content on famous persons. Journal of the American society for information science and technology 62, 10 (2011), 1899–1915.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017).Google ScholarGoogle Scholar
  9. Juryong Cheon and Youngjoong Ko. 2021. Parallel sentence extraction to improve cross-language information retrieval from Wikipedia. Journal of Information Science 47, 2 (2021), 281–293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Filippo Chiarello, Leonello Trivelli, Andrea Bonaccorsi, and Gualtiero Fantoni. 2018. Extracting and mapping industry 4.0 technologies using wikipedia. Computers in Industry 100 (2018), 244–257.Google ScholarGoogle ScholarCross RefCross Ref
  11. Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. 2020. Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180 (2020).Google ScholarGoogle Scholar
  12. Dan Cosley, Dan Frankowski, Loren Terveen, and John Riedl. 2007. SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces. 32–41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny Vrandečić. 2014. Introducing wikidata to the linked data web. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I 13. Springer, 50–65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wikimedia Foundation. 2023. Vietnamese Wikipedia. https://en.wikipedia.org/wiki/Vietnamese_Wikipea (Accessed: 02/03/2023)Google ScholarGoogle Scholar
  15. Wikimedia Foundation. 2023. Wikipedia:Statistics. https://en.wikipedia.org/wiki/Wikipedia:Statistics (Accessed: 02/03/2023)Google ScholarGoogle Scholar
  16. Evgeniy Gabrilovich, Shaul Markovitch, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis.. In IJcAI, Vol. 7. 1606–1611.Google ScholarGoogle Scholar
  17. Lukas Galke, Ahmed Saleh, and Ansgar Scherp. 2017. Word embeddings for practical information retrieval. In Informatik 2017. Gesellschaft für Informatik, 2155–2167.Google ScholarGoogle Scholar
  18. Goran Glavaš, Marc Franco-Salvador, Simone P Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems 143 (2018), 1–9.Google ScholarGoogle Scholar
  19. Joaquin Gómez and Pere-Pau Vázquez. 2022. An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles. Applied Sciences 12, 11 (2022), 5664.Google ScholarGoogle ScholarCross RefCross Ref
  20. Brent Hecht and Darren Gergle. 2009. Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth international conference on communities and technologies. 11–20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Isaac Johnson and Emily Lescak. 2022. Considerations for Multilingual Wikipedia Research. arXiv preprint arXiv:2204.02483 (2022).Google ScholarGoogle Scholar
  22. Lucie-Aimée Kaffee, Pavlos Vougiouklis, and Elena Simperl. 2022. Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective. Semantic Web 13, 2 (2022), 163–194.Google ScholarGoogle ScholarCross RefCross Ref
  23. Sowmya Lakshmi and BR Shambhavi. 2020. Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages. Journal of Computational and Theoretical Nanoscience 17, 1 (2020), 54–60.Google ScholarGoogle ScholarCross RefCross Ref
  24. Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 14, 3 (2015), 1–22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Saket Maheshwary and Hemant Misra. 2018. Matching resumes to jobs via deep siamese network. In Companion Proceedings of the The Web Conference 2018. 87–88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nhu Nguyen, Dung Cao, and Anh Nguyen. 2018. Automatically mapping Wikipedia infobox attributes to DBpedia properties for fast deployment of Vietnamese DBpedia chapter. In Asian Conference on Intelligent Information and Database Systems. Springer, 127–136.Google ScholarGoogle ScholarCross RefCross Ref
  27. Chien-Chun Ni, Kin Sum Liu, and Nicolas Torzec. 2020. Layered graph embedding for entity recommendation using wikipedia in the yahoo! knowledge graph. In Companion Proceedings of the Web Conference 2020. 811–818.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jian-Yun Nie. 2010. Cross-language information retrieval. Vol. 8. Morgan & Claypool Publishers.Google ScholarGoogle Scholar
  29. Tiziano Piccardi and Robert West. 2021. Crosslingual topic modeling with WikiPDA. In Proceedings of the Web Conference 2021. 3032–3041.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020. A topic-aligned multilingual corpus of wikipedia articles for studying information asymmetry in low resource languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2373–2380.Google ScholarGoogle Scholar
  31. Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2022. Information asymmetry in Wikipedia across different languages: A statistical analysis. Journal of the Association for Information Science and Technology 73, 3 (2022), 347–361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Motaz Saad, David Langlois, and Kamel Smaïli. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia-Social and Behavioral Sciences 95 (2013), 40–47.Google ScholarGoogle ScholarCross RefCross Ref
  33. Yogesh Sankarasubramaniam, Krishnan Ramanathan, and Subhankar Ghosh. 2014. Text summarization using Wikipedia. Information Processing & Management 50, 3 (2014), 443–461.Google ScholarGoogle ScholarCross RefCross Ref
  34. Procheta Sen, Debasis Ganguly, and Gareth Jones. 2019. Word-Node2Vec: Improving word embedding with document-level non-local word co-occurrences. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1041–1051.Google ScholarGoogle Scholar
  35. Tan Thongtan and Tanasanee Phienthrakul. 2019. Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 407–414.Google ScholarGoogle ScholarCross RefCross Ref
  36. Pu Wang and Carlotta Domeniconi. 2008. Building semantic kernels for text classification using wikipedia. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 713–721.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kyle Wilson. 2020-02-11. The World’s Second Largest Wikipedia Is Written Almost Entirely by One Bot. https://www.vice.com/en/article/4agamm/the-worlds-second-largest-wikipedia-is-written-almost-entirely-by-one-bot (Accessed: 02/03/2023)Google ScholarGoogle Scholar
  38. Samuel C Woolley and Philip N Howard. 2018. Computational propaganda: Political parties, politicians, and political manipulation on social media. Oxford University Press.Google ScholarGoogle Scholar
  39. Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web. 975–985.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting Cross-Lingual Information Gaps in Wikipedia

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
          April 2023
          1567 pages
          ISBN:9781450394192
          DOI:10.1145/3543873

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 April 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

          Upcoming Conference

          WWW '24
          The ACM Web Conference 2024
          May 13 - 17, 2024
          Singapore , Singapore
        • Article Metrics

          • Downloads (Last 12 months)109
          • Downloads (Last 6 weeks)8

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format