skip to main content
10.1145/3366424.3383571acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Matching Ukrainian Wikipedia Red Links with English Wikipedia’s Articles

Published: 20 April 2020 Publication History

Abstract

This work tackles the problem of matching Wikipedia red links with existing articles. Links in Wikipedia pages are considered red when lead to nonexistent articles. In other Wikipedia editions could exist articles that correspond to such red links. In our work, we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We define the task as a Named Entity Linking problem because red link titles are mostly named entities. We solve it in a context of Ukrainian red links and English existing pages. We created a dataset of 3171 most frequent Ukrainian red links and a dataset of almost 3 million pairs of red links and the most probable candidates for the correspondent pages in English Wikipedia. This dataset is publicly released1. In this work we define conceptual characteristics of the data — word and graph properties — based on its analysis and exploit these properties in entity resolution. BabelNet knowledge base was applied to this task and was regarded as a baseline for our approach (F1 score = 32 %). To improve the result we introduced several similarity metrics based on mentioned red links characteristics. Combined in a linear model they resulted in F1 score = 85 %. To the best of our knowledge, we are the first to state the problem and propose a solution for red links in Ukrainian Wikipedia edition.

References

[1]
Nuha Alghamdi and Fatmah Assiri. 2019. A Comparison of fastText Implementations Using Arabic Text Classification. In Proceedings of SAI Intelligent Systems Conference. Springer, 306–311.
[2]
Christopher M Bishop. 2006. Pattern recognition and machine learning. Springer.
[3]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, 1247–1250.
[4]
MS Fabian, K Gjergji, WEIKUM Gerhard, 2007. Yago: A core of semantic knowledge unifying wordnet and wikipedia. In 16th International World Wide Web Conference, WWW. 697–706.
[5]
Palash Goyal and Emilio Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems 151 (2018), 78–94.
[6]
Dan Jurafsky and James H. Martin. 2019. Speech and Language Processing. In Speech and Language Processing. Third Edition draft.
[7]
Sven Kosub. 2019. A note on the triangle inequality for the jaccard distance. Pattern Recognition Letters 120 (2019), 36–38.
[8]
John M Lee. 2013. Smooth manifolds. In Introduction to Smooth Manifolds. Springer.
[9]
Andrea Moro and Roberto Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015). 288–297.
[10]
Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2017. A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902(2017).
[11]
Wei Shen, Jianyong Wang, and Jiawei Han. 2014. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27, 2(2014), 443–460.
[12]
Ehsan Sherkat and Evangelos E Milios. 2017. Vector embedding of wikipedia concepts and entities. In International conference on applications of natural language to information systems. Springer, 418–428.
[13]
Li Yujian and Liu Bo. 2007. A normalized Levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29, 6(2007), 1091–1095.
[14]
Zhicheng Zheng, Fangtao Li, Minlie Huang, and Xiaoyan Zhu. 2010. Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 483–491.
[15]
Stefan Zwicklbauer, Christin Seifert, and Michael Granitzer. 2016. Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 425–434.

Cited By

View all
  • (2022)Towards PageRank Update in a Streaming Graph by Incremental Random WalkIEEE Access10.1109/ACCESS.2022.314929610(15805-15817)Online publication date: 2022

Index Terms

  1. Matching Ukrainian Wikipedia Red Links with English Wikipedia’s Articles
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '20: Companion Proceedings of the Web Conference 2020
        April 2020
        854 pages
        ISBN:9781450370240
        DOI:10.1145/3366424
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 April 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Entity linking
        2. Red links
        3. Wikipedia

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        WWW '20
        Sponsor:
        WWW '20: The Web Conference 2020
        April 20 - 24, 2020
        Taipei, Taiwan

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)2
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 02 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Towards PageRank Update in a Streaming Graph by Incremental Random WalkIEEE Access10.1109/ACCESS.2022.314929610(15805-15817)Online publication date: 2022

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media