Dense Re-Ranking with Weak Supervision for RDF Dataset Search

Chen, Qiaosheng; Huang, Zixian; Zhang, Zhiyang; Luo, Weiqing; Lin, Tengteng; Shi, Qing; Cheng, Gong

doi:10.1007/978-3-031-47240-4_2

Qiaosheng Chen¹⁶,
Zixian Huang¹⁶,
Zhiyang Zhang¹⁶,
Weiqing Luo¹⁶,
Tengteng Lin¹⁶,
Qing Shi¹⁶ &
…
Gong Cheng ORCID: orcid.org/0000-0003-3539-7776¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14265))

Included in the following conference series:

International Semantic Web Conference

1420 Accesses
1 Citations

Abstract

Dataset search aims to find datasets that are relevant to a keyword query. Existing dataset search engines rely on conventional sparse retrieval models (e.g., BM25). Dense models (e.g., BERT-based) remain under-investigated for two reasons: the limited availability of labeled data for fine-tuning such a deep neural model, and its limited input capacity relative to the large size of a dataset. To fill the gap, in this paper, we study dense re-ranking for RDF dataset search. Our re-ranking model encodes the metadata of RDF datasets and also their actual RDF data—by extracting a small yet representative subset of data to accommodate large datasets. To address the insufficiency of training data, we adopt a coarse-to-fine tuning strategy where we warm up the model with weak supervision from a large set of automatically generated queries and relevance labels. Experiments on the ACORDAR test collection demonstrate the effectiveness of our approach, which considerably improves the retrieval accuracy of existing sparse models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Anadiotis, A.G., et al.: Graph integration of structured, semistructured and unstructured data for data journalism. Inf. Syst. 104, 101846 (2022). https://doi.org/10.1016/j.is.2021.101846
Article Google Scholar
Benjelloun, O., Chen, S., Noy, N.F.: Google dataset search by the numbers. In: ISWC 2020, vol. 12507, pp. 667–682 (2020). https://doi.org/10.1007/978-3-030-62466-8_41
Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open Web ecosystem. In: WWW 2019, pp. 1365–1375 (2019). https://doi.org/10.1145/3308558.3313685
Cebiric, S., Goasdoué, F., Kondylakis, H., Kotzinos, D., Manolescu, I., Troullinou, G., Zneika, M.: Summarizing semantic graphs: a survey. VLDB J. 28(3), 295–327 (2019). https://doi.org/10.1007/s00778-018-0528-3
Article Google Scholar
Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L., Kacprzak, E., Groth, P.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2020). https://doi.org/10.1007/s00778-019-00564-x
Article Google Scholar
Chen, J., Wang, X., Cheng, G., Kharlamov, E., Qu, Y.: Towards more usable dataset search: From query characterization to snippet generation. In: CIKM 2019, pp. 2445–2448 (2019). https://doi.org/10.1145/3357384.3358096
Chen, J., Chen, Q., Li, D., Huang, Y.: Sedr: segment representation learning for long documents dense retrieval. CoRR abs/2211.10841 (2022). https://doi.org/10.48550/arXiv.2211.10841
Cheng, G., Jin, C., Ding, W., Xu, D., Qu, Y.: Generating illustrative snippets for open data on the Web. In: WSDM 2017, pp. 151–159 (2017). https://doi.org/10.1145/3018661.3018670
Cheng, G., Jin, C., Qu, Y.: HIEDS: a generic and efficient approach to hierarchical dataset summarization. In: IJCAI 2016, pp. 3705–3711 (2016)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, vol. 1, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learning. CoRR abs/2112.09118 (2021). 10.48550/arXiv.2112.09118
Google Scholar
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: EMNLP 2020, pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.550
Kato, M.P., Ohshima, H., Liu, Y., Chen, H.: A test collection for ad-hoc dataset retrieval. In: SIGIR 2021, pp. 2450–2456 (2021). https://doi.org/10.1145/3404835.3463261
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: SIGIR 2020, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR 2015 (2015)
Google Scholar
Koesten, L.M., Kacprzak, E., Tennison, J.F.A., Simperl, E.: The trials and tribulations of working with structured data - a study on information seeking behaviour. In: CHI 2017, pp. 1277–1289 (2017). https://doi.org/10.1145/3025453.3025838
Lin, J., Nogueira, R.F., Yates, A.: Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, San Rafael (2021). https://doi.org/10.2200/S01123ED1V01Y202108HLT053
Lin, T., et al.: ACORDAR: a test collection for ad hoc content-based (RDF) dataset retrieval. In: SIGIR 2022, pp. 2981–2991 (2022). https://doi.org/10.1145/3477495.3531729
Liu, D., Cheng, G., Liu, Q., Qu, Y.: Fast and practical snippet generation for RDF datasets. ACM Trans. Web 13(4), 19:1–19:38 (2019). https://doi.org/10.1145/3365575
Liu, Q., Cheng, G., Gunaratna, K., Qu, Y.: Entity summarization: state of the art and future challenges. J. Web Semant. 69, 100647 (2021). https://doi.org/10.1016/j.websem.2021.100647
Article Google Scholar
Luo, H., Li, S., Gao, M., Yu, S., Glass, J.R.: Cooperative self-training of machine reading comprehension. In: NAACL 2022, pp. 244–257 (2022). https://doi.org/10.18653/v1/2022.naacl-main.18
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL 2009, pp. 1003–1011 (2009)
Google Scholar
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo 2016, vol. 1773 (2016)
Google Scholar
Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: ISWC 2018, pp. 137–153 (2018). https://doi.org/10.1007/978-3-030-00668-6_9
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1-140:67 (2020)
MathSciNet Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009). https://doi.org/10.1561/1500000019
Article Google Scholar
Wang, X., Cheng, G., Kharlamov, E.: Towards multi-facet snippets for dataset search. In: PROFILES & SEMEX 2019, pp. 1–6 (2019)
Google Scholar
Wang, X., et al.: PCSG: pattern-coverage snippet generation for RDF datasets. In: ISWC 2021, pp. 3–20 (2021). https://doi.org/10.1007/978-3-030-88361-4_1
Wang, X., Cheng, G., Pan, J.Z., Kharlamov, E., Qu, Y.: BANDAR: benchmarking snippet generation algorithms for (RDF) dataset search. IEEE Trans. Knowl. Data Eng. 35(2), 1227–1241 (2023). https://doi.org/10.1109/TKDE.2021.3095309
Article Google Scholar
Wang, X., Lin, T., Luo, W., Cheng, G., Qu, Y.: CKGSE: a prototype search engine for chinese knowledge graphs. Data Intell. 4(1), 41–65 (2022). https://doi.org/10.1162/dint_a_00118
Article Google Scholar
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: ICLR 2021 (2021). https://openreview.net/forum?id=zeFrfgyZln
Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives. In: SIGIR 2021, pp. 1503–1512 (2021). https://doi.org/10.1145/3404835.3462880
Zhao, W.X., Liu, J., Ren, R., Wen, J.: Dense text retrieval based on pretrained language models: a survey. CoRR abs/2211.14876 (2022). https://doi.org/10.48550/arXiv.2211.14876

Download references

Acknowledgements

This work was supported by the NSFC (62072224).

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Qiaosheng Chen, Zixian Huang, Zhiyang Zhang, Weiqing Luo, Tengteng Lin, Qing Shi & Gong Cheng

Authors

Qiaosheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zixian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weiqing Luo
View author publications
You can also search for this author in PubMed Google Scholar
Tengteng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Qing Shi
View author publications
You can also search for this author in PubMed Google Scholar
Gong Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gong Cheng .

Editor information

Editors and Affiliations

University of Liverpool, Liverpool, UK
Terry R. Payne
University of Bologna, Bologna, Italy
Valentina Presutti
Southeast University, Nanjing, China
Guilin Qi
Universidad Politécnica de Madrid, Madrid, Spain
María Poveda-Villalón
Huawei Technologies R&D UK, Edinburgh, UK
Giorgos Stoilos
Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
Laura Hollink
IT University of Copenhagen, Copenhagen, Denmark
Zoi Kaoudi
Nanjing University, Nanjing, China
Gong Cheng
Tsinghua University, Beijing, Beijing, China
Juanzi Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Q. et al. (2023). Dense Re-Ranking with Weak Supervision for RDF Dataset Search. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14265. Springer, Cham. https://doi.org/10.1007/978-3-031-47240-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-47240-4_2
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47239-8
Online ISBN: 978-3-031-47240-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

Dense Re-Ranking with Weak Supervision for RDF Dataset Search