Skip to main content
Log in

Ad hoc retrieval via entity linking and semantic similarity

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Semantic search has emerged as a possible way for addressing the challenges of traditional keyword-based retrieval systems such as the vocabulary gap between the query and document spaces. In this paper, we propose a novel semantic retrieval framework that uses semantic entity linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between those concepts. The core of our proposed work is a semantic-enabled language model that estimates the probability of generating query concepts given values assigned to document concepts. The semantic retrieval framework also provides basis for interpolating keyword-based retrieval systems with the semantic-enabled language model. We conduct comprehensive experiments over several Trec document collections and analyze the performance of different configurations of the framework across multiple retrieval measures. Our experimental results show that the proposed semantic retrieval model has a synergistic impact on the results obtained through the state-of-the-art keyword-based systems, and the consideration of semantic information can complement and enhance the performance of such retrieval models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. While recognizing the differences, we use relatedness and similarity interchangeably in this paper.

  2. P(d) is used in some retrieval methods for modeling document-specific criteria such as authority.

  3. http://lucene.apache.org/.

  4. https://github.com/SemanticLM/SELM.

References

  1. Billerbeck B, Demartini G, Firan C, Iofciu T, Krestel R (2010) Exploiting click-through data for entity retrieval. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 803–804

  2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022

    MATH  Google Scholar 

  3. Cao G, Nie J-Y, Bai J (2005) Integrating word relationships into language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 298–305

  4. Chakrabarti S, Kasturi S, Balakrishnan B, Ramakrishnan G, Saraf R (2012) Compressed data structures for annotated web search. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 121–130

  5. Cheng T, Yan X, Chang KC-C (2007) Entityrank: searching entities directly and holistically. In: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, pp 387–398

  6. Cornolti M, Ferragina P, Ciaramita M (2013) A framework for benchmarking entity-annotation systems. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 249–260

  7. Dalton J, Dietz L, Allan J (2014) Entity query feature expansion using knowledge base links. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, pp 365–374

  8. Ding L, Finin T, Joshi A, Pan R, Cost RS, Peng Y, Reddivari P, Doshi V, Sachs J (2004) Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, pp 652–659

  9. Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8

    Article  Google Scholar 

  10. Ensan F, Bagheri E (2017) Document retrieval model through semantic linking. In: Proceedings of the tenth ACM international conference on web search and data mining, WSDM 2017, Cambridge, United Kingdom, February 6–10, 2017, pp 181–190

  11. Ferragina P, Scaiella U (2010) Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, pp 1625–1628

  12. Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75

    Article  Google Scholar 

  13. Fox EA, Shaw JA (1994) Combination of multiple searches. NIST Special Publication SP, pp 243–243

  14. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI’07. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 1606–1611

  15. Gärtner M, Rauber A, Berger H (2014) Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowl Inf Syst 41(3):761–792

    Article  Google Scholar 

  16. Guo J, Xu G, Cheng X, Li H (2009) Named entity recognition in query. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 267–274

  17. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM

  18. Ji G, He S, Xu L, Liu K, Zhao J (2015) Knowledge graph embedding via dynamic mapping matrix. In: ACL (1), pp 687–696

  19. Jin R, Hauptmann AG, Zhai CX (2002) Language model for information retrieval. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 42–48

  20. Kanhabua N, Blanco R, Matthews M (2011) Ranking related news predictions. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 755–764

  21. Kaptein R, Serdyukov P, De Vries A, Kamps J (2010) Entity ranking using wikipedia as a pivot. In: Proceedings of the 19th ACM international conference on information and knowledge management. ACM, pp 69–78

  22. Karimzadehgan M, Zhai C (2010) Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In: Proceedings of the 33rd ACM SIGIR, pp 323–330

  23. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth ICML, pp 282–289

  24. Lashkari F, Ensan F, Bagheri E, Ghorbani AA (2017) Efficient indexing for semantic search. Expert Syst Appl 73:92–114

    Article  Google Scholar 

  25. Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 120–127

  26. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML, vol 14, pp 1188–1196

  27. Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, pp 267–276

  28. Liu T-Y et al (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331

    Article  Google Scholar 

  29. McCallum A, Bellare K, Pereira F (2012) A conditional random field for discriminatively-trained finite-state string edit distance. arXiv:1207.1406

  30. Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 472–479

  31. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  32. Miller DR, Leek T, Schwartz RM (1999) A hidden markov model information retrieval system. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 214–221

  33. Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239

    Article  MathSciNet  Google Scholar 

  34. Mishra N, Saha Roy R, Ganguly N, Laxman S, Choudhury M (2011) Unsupervised query segmentation using only query logs. In: Proceedings of the 20th international conference companion on World wide web. ACM, pp 91–92

  35. Mottin D, Palpanas T, Velegrakis Y (2013) Entity ranking using click-log information. Intell Data Anal 17(5):837–856

    Article  Google Scholar 

  36. Ni Y, Xu QK, Cao F, Mass Y, Sheinwald D, Zhu HJ, Cao SS (2016) Semantic documents relatedness using concept graph representation. In: Proceedings of the ninth ACM international conference on Web search and data mining. ACM, pp 635–644

  37. Otegi A, Arregi X, Ansa O, Agirre E (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44(3):689–718

    Article  Google Scholar 

  38. Peng F, McCallum A (2006) Information extraction from research papers using conditional random fields. Inf Proces Manag 42(4):963–979

    Article  Google Scholar 

  39. Pinto D, McCallum A, Wei X, Croft WB (2003) Table extraction using conditional random fields. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 235–242

  40. Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–281

  41. Prud E, Seaborne A, et al. (2006) Sparql query language for rdf

  42. Raviv H, Kurland O, Carmel D (2016) Document retrieval using entity-based language models. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 65–74

  43. Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for nlp frameworks. Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en

  44. Severyn A, Moschitti A (2015) Learning to rank short text pairs with convolutional deep neural networks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 373–382

  45. Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45

    Article  Google Scholar 

  46. Tran T, Zhang L (2014) Keyword query routing. IEEE Trans Knowl Data Eng 26(2):363–375

    Article  Google Scholar 

  47. Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394

  48. Vallet D,.Fernández M, Castells P (2005) An ontology-based information retrieval model. In: The semantic Web: research and applications. Springer, pp 455–470

  49. Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph and text jointly embedding. In: EMNLP, vol 14, pp 1591–1601

  50. Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI workshop on wikipedia and artificial intelligence: an evolving synergy, AAAI Press, Chicago, USA, pp 25–30

  51. Xiong C, Callan J (2015) Esdrank: Connecting query and documents through external semi-structured data. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 951–960

  52. Xiong C, Callan J (2015) Query expansion with freebase. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 111–120

  53. Xiong C, Callan J, Liu T.-Y (2016) Bag-of-entities representation for ranking. In: Proceedings of the 2016 ACM on international conference on the theory of information retrieval. ACM, pp 181–184

  54. Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 4–11

  55. Yuan P, Xie C, Jin H, Liu L, Yang G, Shi X (2014) Dynamic and fast processing of queries on large-scale rdf data. Knowl Inf Syst 41(2):311–334

    Article  Google Scholar 

  56. Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 334–342

  57. Zhiltsov N, Kotov A, Nikolaev F (2015) Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 253–262

  58. Zhu J, Huang X, Song D, Rüger S (2010) Integrating multiple document features in language models for expert finding. Knowl Inf Syst 23(1):29–54

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially funded by Ferdowsi University of Mashhad Grant Number 2/39715.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faezeh Ensan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ensan, F., Du, W. Ad hoc retrieval via entity linking and semantic similarity. Knowl Inf Syst 58, 551–583 (2019). https://doi.org/10.1007/s10115-018-1190-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1190-1

Keywords

Navigation