RuBQ 2.0: An Innovated Russian Question Answering Dataset

Rybin, Ivan; Korablinov, Vladislav; Efimov, Pavel; Braslavski, Pavel

doi:10.1007/978-3-030-77385-4_32

Ivan Rybin¹⁶,
Vladislav Korablinov¹⁶,
Pavel Efimov¹⁶ &
…
Pavel Braslavski ORCID: orcid.org/0000-0002-6964-458X^17,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12731))

Included in the following conference series:

European Semantic Web Conference

3064 Accesses
10 Citations

Abstract

The paper describes the second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata. Whereas the first version builds on Q&A pairs harvested online, the extension is based on questions obtained through search engine query suggestion services. The questions underwent crowdsourced and in-house annotation in a quite different fashion compared to the first edition. The dataset doubled in size: RuBQ 2.0 contains 2,910 questions along with the answers and SPARQL queries. The dataset also incorporates answer-bearing paragraphs from Wikipedia for the majority of questions. The dataset is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing. We provide the analysis of the dataset and report several KBQA and MRC baseline results. The dataset is freely available under the CC-BY-4.0 license.

Ivan Rybin—work done as an intern at JetBrains Research.

Resource location: https://doi.org/10.5281/zenodo.4345696.

Project page: https://github.com/vladislavneon/RuBQ.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

RuBQ: A Russian Dataset for Question Answering over Wikidata

WDAqua-core0: A Question Answering Component for the Research Community

A Comparative Study of Question Answering over Knowledge Bases

Notes

1.
https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers.
2.
https://github.com/vladislavneon/RuBQ.
3.
https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all.
4.
https://toloka.ai/.
5.
In case of multiple answers majority voting was applied to individual answers, not the whole list.
6.
Note that prefixes sent to the query suggestion service do not guarantee that the returned question expresses the intended property.
7.
The answer is four; Google returns an instant answer with a supporting piece of text.
8.
Our approach resulted in a slightly lower share of simple questions in QSQ compared to WebQuestions: 76% vs. 85%. It can be attributed to the source of questions and the collection process, as well as to differences in Freebase vs. Wikidata structure.
9.
https://ru.wikipedia.org/w/api.php.
10.
http://docs.deeppavlov.ai/en/master/features/models/kbqa.html.
11.
https://qanswer-frontend.univ-st-etienne.fr/.
12.
https://github.com/vladislavneon/kbqa-tools/rubq-baseline.
13.
http://docs.deeppavlov.ai/en/master/features/models/syntaxparser.html.
14.
Although these regular expressions and prefixes for collecting QSQs were developed independently and for different sets of properties, this approach can introduce bias in results.
15.
https://huggingface.co/bert-base-multilingual-cased.
16.
https://yandex.ru/dev/mystem/ (in Russian).

References

Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: ACL, pp. 4623–4637 (2020)
Google Scholar
Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP, pp. 1533–1544 (2013)
Google Scholar
Burtsev, M., et al.: DeepPavlov: open-source library for dialogue systems. In: ACL (System Demonstrations), pp. 122–127 (2018)
Google Scholar
Chen, W., et al.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347 (2020)
Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. TACL 8, 454–470 (2020)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Diefenbach, D., Giménez-García, J., Both, A., Singh, K., Maret, P.: QAnswer KG: designing a portable question answering system over rdf data. In: ESWC, pp. 429–445 (2020)
Google Scholar
Duan, N.: Overview of the NLPCC 2019 shared task: open domain semantic parsing. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 811–817. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_74
Chapter Google Scholar
Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: a large dataset for complex question answering over Wikidata and DBpedia. In: ISWC, pp. 69–78 (2019)
Google Scholar
Dunn, M., et al.: SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179 (2017)
Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD-Russian reading comprehension dataset: description and analysis. In: CLEF, pp. 3–15 (2020)
Google Scholar
Ferrucci, D., et al.: Building Watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010)
Google Scholar
Grau, B., Ligozat, A.L.: A corpus for hybrid question answering systems. In: Companion Proceedings of the The Web Conference 2018, pp. 1081–1086 (2018)
Google Scholar
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL. pp, 1601–1611 (2017)
Google Scholar
Korablinov, V., Braslavski, P.: RuBQ: a Russian dataset for question answering over Wikidata. In: ISWC, pp. 97–110 (2020)
Google Scholar
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. TACL 7, 453–466 (2019)
Article Google Scholar
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. In: ACL, pp. 7315–7330 (2020)
Google Scholar
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. arXiv preprint arXiv:2007.15207 (2020)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: ACL, pp. 784–789 (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: EMNLP, pp. 2383–2392 (2016)
Google Scholar
Savenkov, D., Agichtein, E.: When a knowledge base is not enough: question answering over knowledge bases with external text data. In: SIGIR, pp. 235–244 (2016)
Google Scholar
Sun, H., Bedrax-Weiss, T., Cohen, W.W.: PullNet: open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019)
Talmor, A., Berant, J.: The web as a knowledge base for answering complex questions. In: NAACL, pp. 641–651 (2018)
Google Scholar
Unger, C., et al.: Question answering over linked data (QALD-4). In: Working Notes for CLEF 2014 Conference, pp. 1172–1180 (2014)
Google Scholar
Usbeck, R., et al.: 9th challenge on question answering over linked data (QALD-9). In: SemDeep-4, NLIWoD4, and QALD-9 Joint Proceedings, pp. 58–64 (2018)
Google Scholar
Yih, W., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. In: ACL, pp. 201–206 (2016)
Google Scholar

Download references

Acknowledgments

We thank Yaroslav Golubev, Dmitry Ustalov, and anonymous reviewers for their valuable comments that helped improve the paper. We are grateful to Toloka for their data annotation grant. PB acknowledges support from the Ministry of Science and Higher Education of the Russian Federation (the project of the development of the regional scientific and educational mathematical center “Ural Mathematical Center”).

Author information

Authors and Affiliations

ITMO University, Saint Petersburg, Russia
Ivan Rybin, Vladislav Korablinov & Pavel Efimov
Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski
HSE University, Moscow, Russia
Pavel Braslavski

Authors

Ivan Rybin
View author publications
You can also search for this author in PubMed Google Scholar
Vladislav Korablinov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Efimov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ghent University, Ghent, Belgium
Ruben Verborgh
Aalborg University, Aalborg, Denmark
Katja Hose
University of Mannheim, Mannheim, Germany
Heiko Paulheim
ERCIM, Sophia Antipolis, France
Pierre-Antoine Champin
University of Siegen, Siegen, Germany
Maria Maleshkova
Universidad Politécnica de Madrid, Boadilla del Monte, Spain
Oscar Corcho
eBay Inc., San Jose, CA, USA
Petar Ristoski
FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Eggenstein-Leopoldshafen, Germany
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rybin, I., Korablinov, V., Efimov, P., Braslavski, P. (2021). RuBQ 2.0: An Innovated Russian Question Answering Dataset. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-77385-4_32
Published: 31 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77384-7
Online ISBN: 978-3-030-77385-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics