Skip to main content

RuBQ 2.0: An Innovated Russian Question Answering Dataset

  • Conference paper
  • First Online:
The Semantic Web (ESWC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12731))

Included in the following conference series:

Abstract

The paper describes the second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata. Whereas the first version builds on Q&A pairs harvested online, the extension is based on questions obtained through search engine query suggestion services. The questions underwent crowdsourced and in-house annotation in a quite different fashion compared to the first edition. The dataset doubled in size: RuBQ 2.0 contains 2,910 questions along with the answers and SPARQL queries. The dataset also incorporates answer-bearing paragraphs from Wikipedia for the majority of questions. The dataset is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing. We provide the analysis of the dataset and report several KBQA and MRC baseline results. The dataset is freely available under the CC-BY-4.0 license.

Ivan Rybin—work done as an intern at JetBrains Research.

Resource location: https://doi.org/10.5281/zenodo.4345696.

Project page: https://github.com/vladislavneon/RuBQ.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers.

  2. 2.

    https://github.com/vladislavneon/RuBQ.

  3. 3.

    https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all.

  4. 4.

    https://toloka.ai/.

  5. 5.

    In case of multiple answers majority voting was applied to individual answers, not the whole list.

  6. 6.

    Note that prefixes sent to the query suggestion service do not guarantee that the returned question expresses the intended property.

  7. 7.

    The answer is four; Google returns an instant answer with a supporting piece of text.

  8. 8.

    Our approach resulted in a slightly lower share of simple questions in QSQ compared to WebQuestions: 76% vs. 85%. It can be attributed to the source of questions and the collection process, as well as to differences in Freebase vs. Wikidata structure.

  9. 9.

    https://ru.wikipedia.org/w/api.php.

  10. 10.

    http://docs.deeppavlov.ai/en/master/features/models/kbqa.html.

  11. 11.

    https://qanswer-frontend.univ-st-etienne.fr/.

  12. 12.

    https://github.com/vladislavneon/kbqa-tools/rubq-baseline.

  13. 13.

    http://docs.deeppavlov.ai/en/master/features/models/syntaxparser.html.

  14. 14.

    Although these regular expressions and prefixes for collecting QSQs were developed independently and for different sets of properties, this approach can introduce bias in results.

  15. 15.

    https://huggingface.co/bert-base-multilingual-cased.

  16. 16.

    https://yandex.ru/dev/mystem/ (in Russian).

References

  1. Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: ACL, pp. 4623–4637 (2020)

    Google Scholar 

  2. Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP, pp. 1533–1544 (2013)

    Google Scholar 

  3. Burtsev, M., et al.: DeepPavlov: open-source library for dialogue systems. In: ACL (System Demonstrations), pp. 122–127 (2018)

    Google Scholar 

  4. Chen, W., et al.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347 (2020)

  5. Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. TACL 8, 454–470 (2020)

    Article  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  7. Diefenbach, D., Giménez-García, J., Both, A., Singh, K., Maret, P.: QAnswer KG: designing a portable question answering system over rdf data. In: ESWC, pp. 429–445 (2020)

    Google Scholar 

  8. Duan, N.: Overview of the NLPCC 2019 shared task: open domain semantic parsing. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 811–817. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_74

    Chapter  Google Scholar 

  9. Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: a large dataset for complex question answering over Wikidata and DBpedia. In: ISWC, pp. 69–78 (2019)

    Google Scholar 

  10. Dunn, M., et al.: SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179 (2017)

  11. Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD-Russian reading comprehension dataset: description and analysis. In: CLEF, pp. 3–15 (2020)

    Google Scholar 

  12. Ferrucci, D., et al.: Building Watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010)

    Google Scholar 

  13. Grau, B., Ligozat, A.L.: A corpus for hybrid question answering systems. In: Companion Proceedings of the The Web Conference 2018, pp. 1081–1086 (2018)

    Google Scholar 

  14. Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL. pp, 1601–1611 (2017)

    Google Scholar 

  15. Korablinov, V., Braslavski, P.: RuBQ: a Russian dataset for question answering over Wikidata. In: ISWC, pp. 97–110 (2020)

    Google Scholar 

  16. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. TACL 7, 453–466 (2019)

    Article  Google Scholar 

  17. Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. In: ACL, pp. 7315–7330 (2020)

    Google Scholar 

  18. Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. arXiv preprint arXiv:2007.15207 (2020)

  19. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: ACL, pp. 784–789 (2018)

    Google Scholar 

  20. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: EMNLP, pp. 2383–2392 (2016)

    Google Scholar 

  21. Savenkov, D., Agichtein, E.: When a knowledge base is not enough: question answering over knowledge bases with external text data. In: SIGIR, pp. 235–244 (2016)

    Google Scholar 

  22. Sun, H., Bedrax-Weiss, T., Cohen, W.W.: PullNet: open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019)

  23. Talmor, A., Berant, J.: The web as a knowledge base for answering complex questions. In: NAACL, pp. 641–651 (2018)

    Google Scholar 

  24. Unger, C., et al.: Question answering over linked data (QALD-4). In: Working Notes for CLEF 2014 Conference, pp. 1172–1180 (2014)

    Google Scholar 

  25. Usbeck, R., et al.: 9th challenge on question answering over linked data (QALD-9). In: SemDeep-4, NLIWoD4, and QALD-9 Joint Proceedings, pp. 58–64 (2018)

    Google Scholar 

  26. Yih, W., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. In: ACL, pp. 201–206 (2016)

    Google Scholar 

Download references

Acknowledgments

We thank Yaroslav Golubev, Dmitry Ustalov, and anonymous reviewers for their valuable comments that helped improve the paper. We are grateful to Toloka for their data annotation grant. PB acknowledges support from the Ministry of Science and Higher Education of the Russian Federation (the project of the development of the regional scientific and educational mathematical center “Ural Mathematical Center”).

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rybin, I., Korablinov, V., Efimov, P., Braslavski, P. (2021). RuBQ 2.0: An Innovated Russian Question Answering Dataset. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77385-4_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77384-7

  • Online ISBN: 978-3-030-77385-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics