ABSTRACT
An information gap exists across Wikipedia’s language editions, with a considerable proportion of articles available in only a few languages. As an illustration, it has been observed that 10 languages possess half of the available Wikipedia articles, despite the existence of 330 Wikipedia language editions. To address this issue, this study presents an approach to identify the information gap between the different language editions of Wikipedia. The proposed approach employs Latent Dirichlet Allocation (LDA) to analyze linked entities in a cross-lingual knowledge graph in order to determine topic distributions for Wikipedia articles in 28 languages. The distance between paired articles across language editions is then calculated. The potential applications of the proposed algorithm to detecting sources of information disparity in Wikipedia are discussed, and directions for future research are put forward.
- Fakhare Alam, Muhammad Afzal, and Khalid Mahmood Malik. 2020. Comparative analysis of semantic similarity techniques for medical text. In 2020 International Conference on Information Networking (ICOIN). IEEE, 106–109.Google ScholarCross Ref
- Hiteshwar Kumar Azad and Akshay Deepak. 2019. A new approach for query expansion using Wikipedia and WordNet. Information sciences 492 (2019), 147–163.Google Scholar
- Vevake Balaraman, Simon Razniewski, and Werner Nutt. 2018. Recoin: relative completeness in Wikidata. In Companion Proceedings of the The Web Conference 2018. 1787–1792.Google ScholarDigital Library
- Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: bridging the wikipedia language gap. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1075–1084.Google ScholarDigital Library
- Alberto Barrón-Cedeno, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014. Proceedings 36. Springer, 424–429.Google ScholarCross Ref
- Hamed Bonab, Sheikh Muhammad Sarwar, and James Allan. 2020. Training effective neural CLIR by bridging the translation gap. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 9–18.Google ScholarDigital Library
- Ewa S Callahan and Susan C Herring. 2011. Cultural bias in Wikipedia content on famous persons. Journal of the American society for information science and technology 62, 10 (2011), 1899–1915.Google ScholarDigital Library
- Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017).Google Scholar
- Juryong Cheon and Youngjoong Ko. 2021. Parallel sentence extraction to improve cross-language information retrieval from Wikipedia. Journal of Information Science 47, 2 (2021), 281–293.Google ScholarDigital Library
- Filippo Chiarello, Leonello Trivelli, Andrea Bonaccorsi, and Gualtiero Fantoni. 2018. Extracting and mapping industry 4.0 technologies using wikipedia. Computers in Industry 100 (2018), 244–257.Google ScholarCross Ref
- Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. 2020. Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180 (2020).Google Scholar
- Dan Cosley, Dan Frankowski, Loren Terveen, and John Riedl. 2007. SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces. 32–41.Google ScholarDigital Library
- Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny Vrandečić. 2014. Introducing wikidata to the linked data web. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I 13. Springer, 50–65.Google ScholarDigital Library
- Wikimedia Foundation. 2023. Vietnamese Wikipedia. https://en.wikipedia.org/wiki/Vietnamese_Wikipea (Accessed: 02/03/2023)Google Scholar
- Wikimedia Foundation. 2023. Wikipedia:Statistics. https://en.wikipedia.org/wiki/Wikipedia:Statistics (Accessed: 02/03/2023)Google Scholar
- Evgeniy Gabrilovich, Shaul Markovitch, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis.. In IJcAI, Vol. 7. 1606–1611.Google Scholar
- Lukas Galke, Ahmed Saleh, and Ansgar Scherp. 2017. Word embeddings for practical information retrieval. In Informatik 2017. Gesellschaft für Informatik, 2155–2167.Google Scholar
- Goran Glavaš, Marc Franco-Salvador, Simone P Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems 143 (2018), 1–9.Google Scholar
- Joaquin Gómez and Pere-Pau Vázquez. 2022. An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles. Applied Sciences 12, 11 (2022), 5664.Google ScholarCross Ref
- Brent Hecht and Darren Gergle. 2009. Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth international conference on communities and technologies. 11–20.Google ScholarDigital Library
- Isaac Johnson and Emily Lescak. 2022. Considerations for Multilingual Wikipedia Research. arXiv preprint arXiv:2204.02483 (2022).Google Scholar
- Lucie-Aimée Kaffee, Pavlos Vougiouklis, and Elena Simperl. 2022. Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective. Semantic Web 13, 2 (2022), 163–194.Google ScholarCross Ref
- Sowmya Lakshmi and BR Shambhavi. 2020. Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages. Journal of Computational and Theoretical Nanoscience 17, 1 (2020), 54–60.Google ScholarCross Ref
- Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 14, 3 (2015), 1–22.Google ScholarDigital Library
- Saket Maheshwary and Hemant Misra. 2018. Matching resumes to jobs via deep siamese network. In Companion Proceedings of the The Web Conference 2018. 87–88.Google ScholarDigital Library
- Nhu Nguyen, Dung Cao, and Anh Nguyen. 2018. Automatically mapping Wikipedia infobox attributes to DBpedia properties for fast deployment of Vietnamese DBpedia chapter. In Asian Conference on Intelligent Information and Database Systems. Springer, 127–136.Google ScholarCross Ref
- Chien-Chun Ni, Kin Sum Liu, and Nicolas Torzec. 2020. Layered graph embedding for entity recommendation using wikipedia in the yahoo! knowledge graph. In Companion Proceedings of the Web Conference 2020. 811–818.Google ScholarDigital Library
- Jian-Yun Nie. 2010. Cross-language information retrieval. Vol. 8. Morgan & Claypool Publishers.Google Scholar
- Tiziano Piccardi and Robert West. 2021. Crosslingual topic modeling with WikiPDA. In Proceedings of the Web Conference 2021. 3032–3041.Google ScholarDigital Library
- Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020. A topic-aligned multilingual corpus of wikipedia articles for studying information asymmetry in low resource languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2373–2380.Google Scholar
- Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2022. Information asymmetry in Wikipedia across different languages: A statistical analysis. Journal of the Association for Information Science and Technology 73, 3 (2022), 347–361.Google ScholarDigital Library
- Motaz Saad, David Langlois, and Kamel Smaïli. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia-Social and Behavioral Sciences 95 (2013), 40–47.Google ScholarCross Ref
- Yogesh Sankarasubramaniam, Krishnan Ramanathan, and Subhankar Ghosh. 2014. Text summarization using Wikipedia. Information Processing & Management 50, 3 (2014), 443–461.Google ScholarCross Ref
- Procheta Sen, Debasis Ganguly, and Gareth Jones. 2019. Word-Node2Vec: Improving word embedding with document-level non-local word co-occurrences. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1041–1051.Google Scholar
- Tan Thongtan and Tanasanee Phienthrakul. 2019. Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 407–414.Google ScholarCross Ref
- Pu Wang and Carlotta Domeniconi. 2008. Building semantic kernels for text classification using wikipedia. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 713–721.Google ScholarDigital Library
- Kyle Wilson. 2020-02-11. The World’s Second Largest Wikipedia Is Written Almost Entirely by One Bot. https://www.vice.com/en/article/4agamm/the-worlds-second-largest-wikipedia-is-written-almost-entirely-by-one-bot (Accessed: 02/03/2023)Google Scholar
- Samuel C Woolley and Philip N Howard. 2018. Computational propaganda: Political parties, politicians, and political manipulation on social media. Oxford University Press.Google Scholar
- Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web. 975–985.Google ScholarDigital Library
Index Terms
- Detecting Cross-Lingual Information Gaps in Wikipedia
Recommendations
Cross lingual text classification by mining multilingual topics from wikipedia
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningThis paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages,...
Cross-media topic mining on wikipedia
MM '13: Proceedings of the 21st ACM international conference on MultimediaAs a collaborative wiki-based encyclopedia, Wikipedia provides a huge amount of articles of various categories. In addition to their text corpus, Wikipedia also contains plenty of images which makes the articles more intuitive for readers to understand. ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Comments