skip to main content
10.1145/3637528.3671654acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Text Matching Indexers in Taobao Search

Published: 24 August 2024 Publication History

Abstract

Product search is an important service on Taobao, the largest e-commerce platform in China. Through this service, users can easily find products relevant to their specific needs. Coping with billion-size query loads, Taobao product search has traditionally relied on classical term-based retrieval models due to their powerful and interpretable indexes. In essence, efficient retrieval hinges on the proper storage of the inverted index. Recent successes involve reducing the size (pruning) of the inverted index but the construction and deployment of lossless static index pruning in practical product search still pose non-trivial challenges.
In this work, we introduce a novel SM art INDexing (SMIND) solution in Taobao product search. SMIND is designed to reduce information loss during the static pruning process by incorporating user search preferences. Specifically, we first construct "user-query-item'' hypergraphs for four different search preferences, namely purchase, click, exposure, and relevance. Then, we develop an efficient TermRank algorithm applied to these hypergraphs, to preserve relevant items based on specific user preferences during the pruning of the inverted indexer. Our approach offers fresh insights into the field of product search, emphasizing that term dependencies in user search preferences go beyond mere text relevance. Moreover, to address the vocabulary mismatch problem inherent in term-based models, we also incorporate an multi-granularity semantic retrieval model to facilitate semantic matching. Empirical results from both offline evaluation and online A/B tests showcase the superiority of SMIND over state-of-the-art methods, especially in commerce metrics with significant improvements of 1.34% in Pay Order Count and 1.50% in Gross Merchandise Value. Besides, SMIND effectively mitigates the Matthew effect of user queries and has been in service for hundreds of millions of daily users since November 2022.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A System for Large-scale Machine Learning. In 12th OSDI. 265--283.
[2]
Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30 (1998), 107--117.
[3]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In 12th CIKM. 426--434.
[4]
David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S Maarek, and Aya Soffer. 2001. Static Index Pruning for Information Retrieval Systems. In 24th SIGIR. 43--50.
[5]
Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, et al. 2021. Extreme Multi-label Learning for Semantic Matching in Product Search. In 27th SIGKDD. 2643--2651.
[6]
Xilun Chen, Kushal Lakhotia, Barlas Oguz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen-tau Yih. 2022. Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?. In Findings of the Association for Computational Linguistics: EMNLP 2022. 250--262.
[7]
Dian Cheng, Jiawei Chen, Wenjun Peng, Wenqin Ye, Fuyu Lv, Tao Zhuang, Xiaoyi Zeng, and Xiangnan He. 2022. IHGNN: Interactive Hypergraph Neural Network for Personalized Product Search. In 31st WWW. 256265.
[8]
Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. In 43rd SIGIR. 1533--1536.
[9]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In 25th NeurIPS. 1223--1231.
[10]
Janez Demsar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR 7 (2006), 1--30.
[11]
Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In 34th SIGIR. 993--1002.
[12]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 12, 7 (2011), 2121--2159.
[13]
Lu Fan, Qimai Li, Bo Liu, Xiao-Ming Wu, Xiaotong Zhang, Fuyu Lv, Guli Lin, Sen Li, Taiwei Jin, and Keping Yang. 2022. Modeling User Behavior with Graph Convolution for personalized product search. In 31st WWW. 203--212.
[14]
Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, and Jason Zien. 2011. Evaluation Strategies for Top-k Queries over Memory-resident Inverted Indexes. Proceedings of the VLDB Endowment 4, 12 (2011), 1213--1224.
[15]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2022. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. In 45th SIGIR. 2353--2359.
[16]
Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In 44th SIGIR. 2288--2292.
[17]
Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic Models for the First-stage Retrieval: A Comprehensive Review. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 1--42.
[18]
Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based Retrieval in Facebook Search. In 26th SIGKDD. 2553--2561.
[19]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search Using Click through Data. In 22nd CIKM. 2333--2338.
[20]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On Using Very Large Target Vocabulary for Neural Machine Translation. arXiv preprint arXiv:1412.2007 (2014).
[21]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale Similarity Search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.
[22]
Sen Li, Fuyu Lv, Taiwei Jin, Guiyang Li, Yukun Zheng, Tao Zhuang, Qingwen Liu, Xiaoyi Zeng, James Kwok, and Qianli Ma. 2022. Query Rewriting in TaoBao Search. In 31st CIKM. 3262--3271.
[23]
Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-Based Product Retrieval in Taobao Search. In 27th SIGKDD. 3181-3189.
[24]
Yiding Liu, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, and Dawei Yin. 2021. Pre-trained Language Model for Web-scale Retrieval in Baidu Search. In 27th SIGKDD. 3365--3375.
[25]
Yiqun Liu, Kaushik Rangadurai, Yunzhong He, Siddarth Malreddy, Xunlong Gui, Xiaoyi Liu, and Fedor Borisyuk. 2021. Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook. In 27th SIGKDD. 3376--3384.
[26]
Xiaohui Long and Torsten Suel. 2003. Optimized Query Execution in Large Search Engines with Global Page Ordering. In Proceedings 2003 VLDB Conference. Elsevier, 129--140.
[27]
Xusheng Luo, Le Bo, Jinhang Wu, Lin Li, Zhiy Luo, Yonghua Yang, and Keping Yang. 2021. AliCoCo2: Commonsense Knowledge Extraction, Representation and Application in E-commerce. In 27th SIGKDD. 3385--3393.
[28]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation. arXiv preprint arXiv:2110.11540 (2021).
[29]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2023. Efficient Document-at-a-Time and Score-at-a-Time Query Evaluation for Learned Sparse Representations. ACM Transactions on Information Systems 41, 4 (2023).
[30]
Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, et al. 2022. Semantic Retrieval at Walmart. In 28th SIGKDD. 3495--3503.
[31]
Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. In 44th SIGIR. 17231727.
[32]
Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Ankit Shingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic Product Search. In 25th SIGKDD. 2876--2885.
[33]
Xichuan Niu, Bofang Li, Chenliang Li, Rong Xiao, Haochuan Sun, Hongbo Deng, and Zhenzhong Chen. 2020. A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce. In 26th SIGKDD. 3405--3415.
[34]
Alexandros Ntoulas and Junghoo Cho. 2007. Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee. In 30th SIGIR. 191--198.
[35]
Debmalya Panigrahi and Sreenivas Gollapudi. 2013. Document Selection for Tiered Indexing in Commerce Search. In 6th WSDM. 73--82.
[36]
Matjaz Perc. 2014. The Matthew Effect in Empirical Data. Journal of the Royal Society Interface 11, 98 (2014), 20140378-20140378.
[37]
Joseph Rocchio. 1971. Relevance Feedback in Information Retrieval. In The Smart Retrieval System-experiments in Automatic Document Processing. 313--323.
[38]
Cristian Rossi, Edleno S. de Moura, Andre L. Carvalho, and Altigran S. da Silva. 2013. Fast Document-at-a-time Query Processing Using Two-tier Indexes. In 36th SIGIR. 183--192.
[39]
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press.
[40]
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. Learning Semantic Representations Using Convolutional Neural Networks for Web Search. In 23rd WWW. 373--374.
[41]
Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In 28th SIGIR. 219--225.
[42]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced Representation Through Knowledge Integration. arXiv preprint arXiv:1904.09223 (2019).
[43]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[44]
Howard Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Information Processing & Management 31, 6 (1995), 831--850.
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In 31st NeurIPS, Vol. 30.
[46]
Shaowei Yao, Jiwei Tan, Xi Chen, Keping Yang, Rong Xiao, Hongbo Deng, and Xiaojun Wan. 2021. Learning a Product Relevance Model from Click-Through Data in E-Commerce. In 30th WWW. 2890--2899.
[47]
Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao Wang, Yu Chen, Tamara L Berg, and Ning Zhang. [n. d.]. CommerceMM: Large-scale Commerce Multimodal Representation Learning with Omni Retrieval. In 28th SIGKDD, pages=4433-4442, year=2022.
[48]
Han Zhang, Hongwei Shen, Yiming Qiu, Yunjiang Jiang, Songlin Wang, Sulong Xu, Yun Xiao, Bo Long, and Wen-Yun Yang. 2021. Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. In 44th SIGIR. ACM, 17181722.
[49]
Han Zhang, Songlin Wang, Kang Zhang, Zhiling Tang, Yunjiang Jiang, Yun Xiao, Weipeng Yan, and Wen-Yun Yang. 2020. Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning. In 43rd SIGIR. 2407--2416.
[50]
Tiancheng Zhao, Xiaopeng Lu, and Kyusong Lee. 2020. SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval. arXiv preprint arXiv:2009.13013 (2020).
[51]
Justin Zobel and Alistair Moffat. 2006. Inverted Files for Text Search Engines. Comput. Surveys 38, 2 (2006), 6-es.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. product search
  2. semantic matching
  3. static pruning inverted index

Qualifiers

  • Research-article

Conference

KDD '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 440
    Total Downloads
  • Downloads (Last 12 months)440
  • Downloads (Last 6 weeks)20
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media