skip to main content
10.1145/3583780.3614930acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Improving Query Correction Using Pre-train Language Model In Search Engines

Published: 21 October 2023 Publication History

Abstract

Query correction is a task that automatically detects and corrects errors in what users type into a search engine. Misspelled queries can lead to user dissatisfaction and churn. However, correcting a user query accurately is not an easy task. One major challenge is that a correction model must be capable of high-level language comprehension. Recently, pre-trained language models (PLMs) have been successfully applied to text correction tasks, but few works have been done on query correction. However, it is nontrivial to directly apply these PLMs to query correction in large-scale search systems due to the following challenging issues: 1) Expensive deployment. Deploying such a model requires expensive computations. 2) Lacking domain knowledge. A neural correction model needs massive training data to activate its power.
To this end, we introduce KSTEM, a Knowledge-based Sequence To Edit Model for Chinese query correction. KSTEM transforms the sequence generation task into sequence tagging by mapping errors into five categories: KEEP, REPLACE, SWAP, DELETE, and INSERT, reducing computational complexity. Additionally, KSTEM adopts 2D position encoding, which is composed of the internal and external order of the words. Meanwhile, to compensate for the lack of domain knowledge, we propose a task-specific training paradigm for query correction, including edit strategy-based pre-training, user click-based post pre-train, and human label-based fine-tuning. Finally, we apply KSTEM to the industrial search system. Extensive offline and online experiments show that KSTEM significantly improves query correction performance. We hope that our experience will benefit frontier researchers.

Supplementary Material

MP4 File (full0992-video.mp4)
This paper introduces KSTEM, a Knowledge-based Sequence To Edit Model for Chinese query correction and task-specific training paradigm.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
Eric Brill and Robert C Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th annual meeting of the association for computational linguistics. 286--293.
[3]
Chris Brockett, William B. Dolan, and Michael Gamon. 2006. Correcting ESL Errors Using Phrasal SMT Techniques. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sydney, Australia, 249--256. https://doi.org/10.3115/1220175.1220207
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
[5]
Jia Chen, Jiaxin Mao, Yiqun Liu, Fan Zhang, Min Zhang, and Shaoping Ma. 2021. Towards a better understanding of query reformulation behavior in web search. In Proceedings of the Web Conference 2021. 743--755.
[6]
Mengyun Chen, Tao Ge, Xingxing Zhang, Furu Wei, and Ming Zhou. 2020. Improving the efficiency of grammatical error correction with erroneous span detection and correction. arXiv preprint arXiv:2010.03260 (2020).
[7]
Qing Chen, Mu Li, and Ming Zhou. 2007. Improving query spelling correction using web search results. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). 181--189.
[8]
Hsun-wen Chiu, Jian-cheng Wu, and Jason S Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. 49--53.
[9]
Silviu Cucerzan and Eric Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 293--300.
[10]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504--3514.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[12]
Huizhong Duan and Bo-June Hsu. 2011. Online spelling correction for query completion. In Proceedings of the 20th international conference on World wide web. 117--126.
[13]
Jianfeng Gao, Chris Quirk, et al . 2010. A large scale ranker-based system for search query spelling correction. In The 23rd International Conference on Computational Linguistics.
[14]
Jai Gupta, Zhen Qin, Michael Bendersky, and Donald Metzler. 2019. Personalized online spell correction for personal search. In The World Wide Web Conference. 2785--2791.
[15]
Matthias Hagen, Martin Potthast, Marcel Gohsen, Anja Rathgeber, and Benno Stein. 2017. A large-scale query spelling correction corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1261--1264.
[16]
Sa?a Hasan, Carmen Heger, and Saab Mansour. 2015. Spelling correction of user search queries through statistical machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 451--460.
[17]
Yanen Li. 2020. Query spelling correction. Query Understanding for Search Engines (2020), 103--127.
[18]
Yanen Li, Huizhong Duan, and ChengXiang Zhai. 2012. CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources. In Proceedings of the 21st International Conference on World Wide Web. 561--562.
[19]
Yanen Li, Huizhong Duan, and ChengXiang Zhai. 2012. A generalized hidden markov model with discriminative training for query spelling correction. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 611--620.
[20]
Shulin Liu, Tao Yang, Tianchi Yue, Feng Zhang, and Di Wang. 2021. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2991--3000.
[21]
Xinyi Liu, Wanxian Guan, Lianyun Li, Hui Li, Chen Lin, Xubin Li, Si Chen, Jian Xu, Hongbo Deng, and Bo Zheng. 2022. Pretraining Representations of Multimodal Multi-query E-commerce Search. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3429--3437.
[22]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[23]
Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys (CSUR) 40, 3 (2008), 1--49.
[24]
Bruno Martins and Mário J Silva. 2004. Spelling correction for search engine queries. In Advances in Natural Language Processing: 4th International Conference, EsTAL 2004, Alicante, Spain, October 20--22, 2004. Proceedings 4. Springer, 372--383.
[25]
Robert C Moore. 2004. Improving IBM word alignment model 1. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 518--525.
[26]
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR--grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592 (2020).
[27]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[28]
Madhura Pande, Vishal Kakkar, Manish Bansal, Surender Kumar, Chinmay Sharma, Himanshu Malhotra, and Praneet Mehta. 2022. Learning-to-Spell: Weak Supervision based Query Correction in E-Commerce Search with Small Strong Labels. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3431--3440.
[29]
Darcey Riley and Daniel Gildea. 2012. Improving the IBM alignment models using variational Bayes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 306--310.
[30]
Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research 69 (2020), 343--418.
[31]
Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 266--274.
[32]
Xu Sun, Anshumali Shrivastava, and Ping Li. 2012. Fast multi-task learning for query spelling correction. In Proceedings of the 21st ACM international conference on Information and knowledge management. 285--294.
[33]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
[34]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[36]
Chao Wang and Rongkai Zhao. 2019. Multi-Candidate Ranking Algorithm Based Spell Correction. In eCOM@ SIGIR.
[37]
Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, and Jingming Liu. 2019. Denoising based sequence-to-sequence pre-training for text generation. arXiv preprint arXiv:1908.08206 (2019).
[38]
Casey Whitelaw, Ben Hutchinson, Grace Y Chung, and Gerard Ellis. 2009. Using the web for language independent spellchecking and autocorrection. (2009).
[39]
Fan Yang, Ali Bagheri Garakani, Yifei Teng, Yan Gao, Jia Liu, Jingyuan Deng, and Yi Sun. 2022. Spelling correction using phonetics in e-commerce search. In ACL 2022 Workshop on ECNLP. https://www.amazon.science/publications/spelling-correction-using-phonetics-in-e-commerce-search
[40]
Lixin Zou, Shengqiang Zhang, Hengyi Cai, Dehong Ma, Suqi Cheng, Shuaiqiang Wang, Daiting Shi, Zhicong Cheng, and Dawei Yin. 2021. Pre-trained language model based ranking in Baidu search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4014--4022.

Cited By

View all
  • (2024)Improving Search Query Accuracy for Specialized Websites Through Intelligent Text Correction and Reconstruction ModelsInformation10.3390/info1511068315:11(683)Online publication date: 1-Nov-2024
  • (2024)Span Confusion is All You Need for Chinese Spelling CorrectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679996(4208-4212)Online publication date: 21-Oct-2024
  • (2024)Local Attention Augmentation for Chinese Spelling CorrectionComputational Science – ICCS 202410.1007/978-3-031-63759-9_44(438-452)Online publication date: 29-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
October 2023
5508 pages
ISBN:9798400701245
DOI:10.1145/3583780
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. pre-train language model
  2. query correction
  3. search engines
  4. text tagging

Qualifiers

  • Research-article

Conference

CIKM '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)7
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Search Query Accuracy for Specialized Websites Through Intelligent Text Correction and Reconstruction ModelsInformation10.3390/info1511068315:11(683)Online publication date: 1-Nov-2024
  • (2024)Span Confusion is All You Need for Chinese Spelling CorrectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679996(4208-4212)Online publication date: 21-Oct-2024
  • (2024)Local Attention Augmentation for Chinese Spelling CorrectionComputational Science – ICCS 202410.1007/978-3-031-63759-9_44(438-452)Online publication date: 29-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media