Abstract
The paper describes the second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata. Whereas the first version builds on Q&A pairs harvested online, the extension is based on questions obtained through search engine query suggestion services. The questions underwent crowdsourced and in-house annotation in a quite different fashion compared to the first edition. The dataset doubled in size: RuBQ 2.0 contains 2,910 questions along with the answers and SPARQL queries. The dataset also incorporates answer-bearing paragraphs from Wikipedia for the majority of questions. The dataset is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing. We provide the analysis of the dataset and report several KBQA and MRC baseline results. The dataset is freely available under the CC-BY-4.0 license.
Ivan Rybin—work done as an intern at JetBrains Research.
Resource location: https://doi.org/10.5281/zenodo.4345696.
Project page: https://github.com/vladislavneon/RuBQ.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
In case of multiple answers majority voting was applied to individual answers, not the whole list.
- 6.
Note that prefixes sent to the query suggestion service do not guarantee that the returned question expresses the intended property.
- 7.
The answer is four; Google returns an instant answer with a supporting piece of text.
- 8.
Our approach resulted in a slightly lower share of simple questions in QSQ compared to WebQuestions: 76% vs. 85%. It can be attributed to the source of questions and the collection process, as well as to differences in Freebase vs. Wikidata structure.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
Although these regular expressions and prefixes for collecting QSQs were developed independently and for different sets of properties, this approach can introduce bias in results.
- 15.
- 16.
https://yandex.ru/dev/mystem/ (in Russian).
References
Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: ACL, pp. 4623–4637 (2020)
Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP, pp. 1533–1544 (2013)
Burtsev, M., et al.: DeepPavlov: open-source library for dialogue systems. In: ACL (System Demonstrations), pp. 122–127 (2018)
Chen, W., et al.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347 (2020)
Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. TACL 8, 454–470 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Diefenbach, D., Giménez-García, J., Both, A., Singh, K., Maret, P.: QAnswer KG: designing a portable question answering system over rdf data. In: ESWC, pp. 429–445 (2020)
Duan, N.: Overview of the NLPCC 2019 shared task: open domain semantic parsing. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 811–817. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_74
Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: a large dataset for complex question answering over Wikidata and DBpedia. In: ISWC, pp. 69–78 (2019)
Dunn, M., et al.: SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179 (2017)
Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD-Russian reading comprehension dataset: description and analysis. In: CLEF, pp. 3–15 (2020)
Ferrucci, D., et al.: Building Watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010)
Grau, B., Ligozat, A.L.: A corpus for hybrid question answering systems. In: Companion Proceedings of the The Web Conference 2018, pp. 1081–1086 (2018)
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL. pp, 1601–1611 (2017)
Korablinov, V., Braslavski, P.: RuBQ: a Russian dataset for question answering over Wikidata. In: ISWC, pp. 97–110 (2020)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. TACL 7, 453–466 (2019)
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. In: ACL, pp. 7315–7330 (2020)
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. arXiv preprint arXiv:2007.15207 (2020)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: ACL, pp. 784–789 (2018)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: EMNLP, pp. 2383–2392 (2016)
Savenkov, D., Agichtein, E.: When a knowledge base is not enough: question answering over knowledge bases with external text data. In: SIGIR, pp. 235–244 (2016)
Sun, H., Bedrax-Weiss, T., Cohen, W.W.: PullNet: open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019)
Talmor, A., Berant, J.: The web as a knowledge base for answering complex questions. In: NAACL, pp. 641–651 (2018)
Unger, C., et al.: Question answering over linked data (QALD-4). In: Working Notes for CLEF 2014 Conference, pp. 1172–1180 (2014)
Usbeck, R., et al.: 9th challenge on question answering over linked data (QALD-9). In: SemDeep-4, NLIWoD4, and QALD-9 Joint Proceedings, pp. 58–64 (2018)
Yih, W., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. In: ACL, pp. 201–206 (2016)
Acknowledgments
We thank Yaroslav Golubev, Dmitry Ustalov, and anonymous reviewers for their valuable comments that helped improve the paper. We are grateful to Toloka for their data annotation grant. PB acknowledges support from the Ministry of Science and Higher Education of the Russian Federation (the project of the development of the regional scientific and educational mathematical center “Ural Mathematical Center”).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rybin, I., Korablinov, V., Efimov, P., Braslavski, P. (2021). RuBQ 2.0: An Innovated Russian Question Answering Dataset. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-77385-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77384-7
Online ISBN: 978-3-030-77385-4
eBook Packages: Computer ScienceComputer Science (R0)