skip to main content
10.1145/3485447.3512073acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Learning Probabilistic Box Embeddings for Effective and Efficient Ranking

Published: 25 April 2022 Publication History

Abstract

Ranking has been one of the most important tasks in information retrieval. With the development of deep representation learning, many researchers propose to encode both the query and items into embedding vectors and rank the items according to the inner product or distance measures in the embedding space. However, the ranking models based on vector embeddings may have shortages in effectiveness and efficiency. For effectiveness, they lack the intrinsic ability to model the diversity and uncertainty of queries and items in ranking. For efficiency, nearest neighbor search in a large collection of item vectors can be costly. In this work, we propose to use the recently proposed probabilistic box embeddings for effective and efficient ranking, in which queries and items are parameterized as high-dimensional axis-aligned hyper-rectangles. For effectiveness, we utilize probabilistic box embeddings to model the diversity and uncertainty with the overlapping relations of the hyper-rectangles, and prove that such overlapping measure is a kernel function which can be adopted in other kernel-based methods. For efficiency, we propose a box embedding-based indexing method, which can safely filter irrelevant items and reduce the retrieval latency. We further design a training strategy to increase the proportion of irrelevant items that can be filtered by the index. Experiments on public datasets show that the box embeddings and the box embedding-based indexing approaches are effective and efficient in two ranking tasks: ad hoc retrieval and product recommendation.

References

[1]
Lars Arge, Mark De Berg, Herman Haverkort, and Ke Yi. 2008. The priority R-tree: A practically efficient and worst-case optimal R-tree. ACM Transactions on Algorithms (TALG) 4, 1 (2008), 1–30.
[2]
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322–331.
[3]
Stefan Berchtold, Daniel A Keim, and Hans-Peter Kriegel. 1996. The X-tree: An index structure for high-dimensional data. In Very Large Data-Bases. 28–39.
[4]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820(2020).
[5]
Zhuyun Dai and Jamie Callan. 2019. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687(2019).
[6]
Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, and Andrew McCallum. 2020. Improving Local Identifiability in Probabilistic Box Embeddings. arXiv preprint arXiv:2010.04831(2020).
[7]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. 253–262.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[9]
Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning. PMLR, 1646–1655.
[10]
Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. arXiv preprint arXiv:2104.07186(2021).
[11]
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2020. Complementing lexical retrieval with semantic residual embedding. arXiv preprint arXiv:2004.13969(2020).
[12]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2946–2953.
[13]
Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data. 47–57.
[14]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939(2015).
[15]
Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.
[16]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
[17]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1(2010), 117–128.
[18]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734(2017).
[19]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data(2019).
[20]
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206.
[21]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906(2020).
[22]
Alice Lai and Julia Hockenmaier. 2017. Learning to predict denotational probabilities for modeling entailment. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 721–730.
[23]
Xiang Li, Luke Vilnis, Dongxu Zhang, Michael Boratko, and Andrew McCallum. 2018. Smoothing the geometry of probabilistic box embeddings. In International Conference on Learning Representations.
[24]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017).
[25]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4(2018), 824–836.
[26]
Lang Mei, Jun He, Hongyan Liu, and Xiaoyong Du. 2019. Latent path connected space model for recommendation. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data. Springer, 163–172.
[27]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.
[28]
Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems 30 (2017), 6338–6347.
[29]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019).
[30]
Hongyu Ren, Weihua Hu, and Jure Leskovec. 2020. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. arXiv preprint arXiv:2002.05969(2020).
[31]
Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94. Springer, 232–241.
[32]
Malcolm Slaney and Michael Casey. 2008. Locality-sensitive hashing for finding nearest neighbors [lecture notes]. IEEE Signal processing magazine 25, 2 (2008), 128–131.
[33]
Sandeep Subramanian and Soumen Chakrabarti. 2018. New embedded representations and evaluation protocols for inferring transitive relations. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1037–1040.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[35]
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361(2015).
[36]
Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627(2018).
[37]
Luke Vilnis and Andrew McCallum. 2014. Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623(2014).
[38]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808(2020).
[39]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality (JDIQ) 10, 4 (2018), 1–20.
[40]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. arXiv preprint arXiv:2104.08051(2021).
[41]
Shuai Zhang, Huoyu Liu, Aston Zhang, Yue Hu, Ce Zhang, Yumeng Li, Tanchao Zhu, Shaojian He, and Wenwu Ou. 2021. Learning User Representations with Hypercuboids for Recommender Systems. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 716–724.
[42]
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Kaiyuan Li, Yushuo Chen, Yujie Lu, Hui Wang, Changxin Tian, Xingyu Pan, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2020. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. arXiv preprint arXiv:2011.01731(2020).

Cited By

View all
  • (2024)Enhancing Recommendation Accuracy and Diversity with Box Embedding: A Universal FrameworkProceedings of the ACM Web Conference 202410.1145/3589334.3645577(3756-3766)Online publication date: 13-May-2024
  • (2024)Resource2Box: Learning To Rank Resources in Distributed Search Using Box Embedding2024 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM59182.2024.00017(101-110)Online publication date: 9-Dec-2024
  • (2024)Optimizing Probabilistic Box Embeddings with Distance Measures2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00106(5088-5100)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Learning Probabilistic Box Embeddings for Effective and Efficient Ranking
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '22: Proceedings of the ACM Web Conference 2022
      April 2022
      3764 pages
      ISBN:9781450390965
      DOI:10.1145/3485447
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 April 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Dense Retrieval
      2. Embeddings
      3. Indexing
      4. Probabilistic Box Embedding
      5. Ranking
      6. Recommendation

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Natural Science Foundation of China

      Conference

      WWW '22
      Sponsor:
      WWW '22: The ACM Web Conference 2022
      April 25 - 29, 2022
      Virtual Event, Lyon, France

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)57
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing Recommendation Accuracy and Diversity with Box Embedding: A Universal FrameworkProceedings of the ACM Web Conference 202410.1145/3589334.3645577(3756-3766)Online publication date: 13-May-2024
      • (2024)Resource2Box: Learning To Rank Resources in Distributed Search Using Box Embedding2024 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM59182.2024.00017(101-110)Online publication date: 9-Dec-2024
      • (2024)Optimizing Probabilistic Box Embeddings with Distance Measures2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00106(5088-5100)Online publication date: 13-May-2024
      • (2023)Improving First-stage Retrieval of Point-of-interest Search by Pre-training ModelsACM Transactions on Information Systems10.1145/363193742:3(1-27)Online publication date: 29-Dec-2023
      • (2023)Contrastive Box Embedding for Collaborative ReasoningProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591654(38-47)Online publication date: 19-Jul-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media