Abstract
Semantic search has emerged as a possible way for addressing the challenges of traditional keyword-based retrieval systems such as the vocabulary gap between the query and document spaces. In this paper, we propose a novel semantic retrieval framework that uses semantic entity linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between those concepts. The core of our proposed work is a semantic-enabled language model that estimates the probability of generating query concepts given values assigned to document concepts. The semantic retrieval framework also provides basis for interpolating keyword-based retrieval systems with the semantic-enabled language model. We conduct comprehensive experiments over several Trec document collections and analyze the performance of different configurations of the framework across multiple retrieval measures. Our experimental results show that the proposed semantic retrieval model has a synergistic impact on the results obtained through the state-of-the-art keyword-based systems, and the consideration of semantic information can complement and enhance the performance of such retrieval models.
Similar content being viewed by others
Notes
While recognizing the differences, we use relatedness and similarity interchangeably in this paper.
P(d) is used in some retrieval methods for modeling document-specific criteria such as authority.
References
Billerbeck B, Demartini G, Firan C, Iofciu T, Krestel R (2010) Exploiting click-through data for entity retrieval. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 803–804
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022
Cao G, Nie J-Y, Bai J (2005) Integrating word relationships into language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 298–305
Chakrabarti S, Kasturi S, Balakrishnan B, Ramakrishnan G, Saraf R (2012) Compressed data structures for annotated web search. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 121–130
Cheng T, Yan X, Chang KC-C (2007) Entityrank: searching entities directly and holistically. In: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, pp 387–398
Cornolti M, Ferragina P, Ciaramita M (2013) A framework for benchmarking entity-annotation systems. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 249–260
Dalton J, Dietz L, Allan J (2014) Entity query feature expansion using knowledge base links. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, pp 365–374
Ding L, Finin T, Joshi A, Pan R, Cost RS, Peng Y, Reddivari P, Doshi V, Sachs J (2004) Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, pp 652–659
Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8
Ensan F, Bagheri E (2017) Document retrieval model through semantic linking. In: Proceedings of the tenth ACM international conference on web search and data mining, WSDM 2017, Cambridge, United Kingdom, February 6–10, 2017, pp 181–190
Ferragina P, Scaiella U (2010) Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, pp 1625–1628
Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75
Fox EA, Shaw JA (1994) Combination of multiple searches. NIST Special Publication SP, pp 243–243
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI’07. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 1606–1611
Gärtner M, Rauber A, Berger H (2014) Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowl Inf Syst 41(3):761–792
Guo J, Xu G, Cheng X, Li H (2009) Named entity recognition in query. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 267–274
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM
Ji G, He S, Xu L, Liu K, Zhao J (2015) Knowledge graph embedding via dynamic mapping matrix. In: ACL (1), pp 687–696
Jin R, Hauptmann AG, Zhai CX (2002) Language model for information retrieval. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 42–48
Kanhabua N, Blanco R, Matthews M (2011) Ranking related news predictions. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 755–764
Kaptein R, Serdyukov P, De Vries A, Kamps J (2010) Entity ranking using wikipedia as a pivot. In: Proceedings of the 19th ACM international conference on information and knowledge management. ACM, pp 69–78
Karimzadehgan M, Zhai C (2010) Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In: Proceedings of the 33rd ACM SIGIR, pp 323–330
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth ICML, pp 282–289
Lashkari F, Ensan F, Bagheri E, Ghorbani AA (2017) Efficient indexing for semantic search. Expert Syst Appl 73:92–114
Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 120–127
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML, vol 14, pp 1188–1196
Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, pp 267–276
Liu T-Y et al (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
McCallum A, Bellare K, Pereira F (2012) A conditional random field for discriminatively-trained finite-state string edit distance. arXiv:1207.1406
Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 472–479
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Miller DR, Leek T, Schwartz RM (1999) A hidden markov model information retrieval system. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 214–221
Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239
Mishra N, Saha Roy R, Ganguly N, Laxman S, Choudhury M (2011) Unsupervised query segmentation using only query logs. In: Proceedings of the 20th international conference companion on World wide web. ACM, pp 91–92
Mottin D, Palpanas T, Velegrakis Y (2013) Entity ranking using click-log information. Intell Data Anal 17(5):837–856
Ni Y, Xu QK, Cao F, Mass Y, Sheinwald D, Zhu HJ, Cao SS (2016) Semantic documents relatedness using concept graph representation. In: Proceedings of the ninth ACM international conference on Web search and data mining. ACM, pp 635–644
Otegi A, Arregi X, Ansa O, Agirre E (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44(3):689–718
Peng F, McCallum A (2006) Information extraction from research papers using conditional random fields. Inf Proces Manag 42(4):963–979
Pinto D, McCallum A, Wei X, Croft WB (2003) Table extraction using conditional random fields. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 235–242
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–281
Prud E, Seaborne A, et al. (2006) Sparql query language for rdf
Raviv H, Kurland O, Carmel D (2016) Document retrieval using entity-based language models. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 65–74
Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for nlp frameworks. Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en
Severyn A, Moschitti A (2015) Learning to rank short text pairs with convolutional deep neural networks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 373–382
Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45
Tran T, Zhang L (2014) Keyword query routing. IEEE Trans Knowl Data Eng 26(2):363–375
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394
Vallet D,.Fernández M, Castells P (2005) An ontology-based information retrieval model. In: The semantic Web: research and applications. Springer, pp 455–470
Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph and text jointly embedding. In: EMNLP, vol 14, pp 1591–1601
Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI workshop on wikipedia and artificial intelligence: an evolving synergy, AAAI Press, Chicago, USA, pp 25–30
Xiong C, Callan J (2015) Esdrank: Connecting query and documents through external semi-structured data. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 951–960
Xiong C, Callan J (2015) Query expansion with freebase. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 111–120
Xiong C, Callan J, Liu T.-Y (2016) Bag-of-entities representation for ranking. In: Proceedings of the 2016 ACM on international conference on the theory of information retrieval. ACM, pp 181–184
Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 4–11
Yuan P, Xie C, Jin H, Liu L, Yang G, Shi X (2014) Dynamic and fast processing of queries on large-scale rdf data. Knowl Inf Syst 41(2):311–334
Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 334–342
Zhiltsov N, Kotov A, Nikolaev F (2015) Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 253–262
Zhu J, Huang X, Song D, Rüger S (2010) Integrating multiple document features in language models for expert finding. Knowl Inf Syst 23(1):29–54
Acknowledgements
This work is partially funded by Ferdowsi University of Mashhad Grant Number 2/39715.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ensan, F., Du, W. Ad hoc retrieval via entity linking and semantic similarity. Knowl Inf Syst 58, 551–583 (2019). https://doi.org/10.1007/s10115-018-1190-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1190-1