research-article

A generalized hidden Markov model with discriminative training for query spelling correction

Authors:

ChengXiang ZhaiAuthors Info & Claims

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 611 - 620

https://doi.org/10.1145/2348283.2348365

Published: 12 August 2012 Publication History

Abstract

Query spelling correction is a crucial component of modern search engines. Existing methods in the literature for search query spelling correction have two major drawbacks. First, they are unable to handle certain important types of spelling errors, such as concatenation and splitting. Second, they cannot efficiently evaluate all the candidate corrections due to the complex form of their scoring functions, and a heuristic filtering step must be applied to select a working set of top-K most promising candidates for final scoring, leading to non-optimal predictions. In this paper we address both limitations and propose a novel generalized Hidden Markov Model with discriminative training that can not only handle all the major types of spelling errors, including splitting and concatenation errors, in a single unified framework, but also efficiently evaluate all the candidate corrections to ensure the finding of a globally optimal correction. Experiments on two query spelling correction datasets demonstrate that the proposed generalized HMM is effective for correcting multiple types of spelling errors. The results also show that it significantly outperforms the current approach for generating top-K candidate corrections, making it a better first-stage filter to enable any other complex spelling correction algorithm to have access to a better working set of candidate corrections as well as to cover splitting and concatenation errors, which no existing method in academic literature can correct.

References

[1]

http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx.

[2]

F. Ahmad and G. Kondrak. Learning a spelling error model from search query logs. In HLT/EMNLP. The Association for Computational Linguistics,2005.

Digital Library

[3]

E. Brill and R. Moore. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2000.

Digital Library

[4]

Q. Chen, M. Li, and M. Zhou. Improving query spelling correction using web search results. In EMNLP-CoNLL, pages 181--189. ACL, 2007.

[5]

S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.

[6]

H. Duan and B.-J. P. Hsu. Online spelling correction for query completion. In Proceedings of the 20th international conference on Worldwide web, WWW '11, pages 117--126. 2011, ACM.

Digital Library

[7]

J. Gao, X. Li, D. Micol, C. Quirk, and X. Sun. A large scale ranker-based system for search query spelling correction. In C.-R. Huang and D. Jurafsky, editors, COLING, pages 358--366. 2010.

Digital Library

[8]

K. Kukich. Techniques for automatically correcting words in text. ACM computing surveys, 24(4), 1992.

Digital Library

[9]

M. J. D. Powell. An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155--162, 1964.

[10]

X. Sun, J. Gao, D. Micol, and C. Quirk. Learning phrase-based spelling error models from click through data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 266--274, Stroudsburg, PA, USA,2010.

Digital Library

[11]

K. Wang, C. Thrasher, and B.-J. P. Hsu. Web scale nlp: a case study on url word breaking. In Proceedings of the 20th international conference on Worldwide web, WWW '11, pages 357--366, New York, NY, USA, 2011. ACM.

Digital Library

[12]

C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis. Using the web for language independent spell checking and autocorrection. In EMNLP, pages 890--899. ACL, 2009.

Digital Library

[13]

evenshtein, V I. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, 10(8), 707--710, 1966.

[14]

inxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '96. ACM, New York, NY.

Digital Library

[15]

onggang Qiu and Hans-Peter Frei. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '93. ACM, New York, NY, USA, 160--169.

Digital Library

[16]

andar Mitra, Amit Singhal, and Chris Buckley. Improving automatic query expansion. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98.

Digital Library

[17]

B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceeding of the 17th international conference on World Wide Web, WWW '08, pages 347--356. 2008.

Digital Library

[18]

Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 285--294. 2011.

Digital Library

[19]

uanhui Wang, ChengXiang Zhai. Mining Term Association Patterns from Search Logs for Effective Query Reformulation. In CIKM'08. 479--488.

Digital Library

[20]

tephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, Volume 2 (COLING '96). Stroudsburg, PA, USA, 836--841.

Digital Library

[21]

.H. Juang. Hidden Markov models for speech recognition. In Technometrics, Vol. 33, No. 3, Aug., 1991.

Digital Library

[22]

. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP '02, pages 1--8, Stroudsburg, PA, USA, 2002.

Digital Library

[23]

. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.

[24]

J. Guo, G. Xu, H. Li, and X. Cheng. A unified and discriminative model for query refinement. In Proceedings of the 31st annual international ACM SIGIR, SIGIR '08, pages 379--386. 2008.

Digital Library

[25]

G. Luec. A data-driven approach for correcting search queries. In Spelling Alteration for Web Search Workshop, July 2011.

[26]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282--289, San Fransisco, 2001. Morgan Kaufmann.

Digital Library

Cited By

Ye DTian BFan JLiu JZhou TChen XLi MMa JFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Improving Query Correction Using Pre-train Language Model In Search EnginesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614930(2999-3008)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614930
Miao YQin JHu SDong YIshikawa YOnizuka M(2020)NGNC: A Flexible and Efficient Framework for Error-Tolerant Query AutocompletionSoftware Foundations for Data Interoperability and Large Scale Graph Data Analytics10.1007/978-3-030-61133-0_8(101-115)Online publication date: 6-Nov-2020
https://doi.org/10.1007/978-3-030-61133-0_8
Downs BFrench TWright KPera MKennington CFails JFails J(2019)Searching for spellcheckersProceedings of the 18th ACM International Conference on Interaction Design and Children10.1145/3311927.3325328(568-573)Online publication date: 12-Jun-2019
https://dl.acm.org/doi/10.1145/3311927.3325328
Show More Cited By

Index Terms

A generalized hidden Markov model with discriminative training for query spelling correction
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

A Large-Scale Query Spelling Correction Corpus
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

We present a new large-scale collection of 54,772 queries with manually annotated spelling corrections. For 9,170 of the queries (16.74%), spelling variants that are different to the original query are proposed. With its size, our new corpus is an order ...
CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Query spelling correction is an important component of modern search engines that can help users to express an information need more accurately and thus improve search quality. In this work we proposed and implemented an end-to-end speller correction ...
Query spelling correction using multi-task learning
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

August 2012

1236 pages

ISBN:9781450314725

DOI:10.1145/2348283

General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '12

Sponsor:

SIGIR

SIGIR '12: The 35th International ACM SIGIR conference on research and development in Information Retrieval

August 12 - 16, 2012

Oregon, Portland, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
639
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)5

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ye DTian BFan JLiu JZhou TChen XLi MMa JFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Improving Query Correction Using Pre-train Language Model In Search EnginesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614930(2999-3008)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614930
Miao YQin JHu SDong YIshikawa YOnizuka M(2020)NGNC: A Flexible and Efficient Framework for Error-Tolerant Query AutocompletionSoftware Foundations for Data Interoperability and Large Scale Graph Data Analytics10.1007/978-3-030-61133-0_8(101-115)Online publication date: 6-Nov-2020
https://doi.org/10.1007/978-3-030-61133-0_8
Downs BFrench TWright KPera MKennington CFails JFails J(2019)Searching for spellcheckersProceedings of the 18th ACM International Conference on Interaction Design and Children10.1145/3311927.3325328(568-573)Online publication date: 12-Jun-2019
https://dl.acm.org/doi/10.1145/3311927.3325328
Duan JJi TWu MWang H(2019)Query Error Correction Algorithm Based on Fusion Sequence to Sequence ModelComputational Collective Intelligence10.1007/978-3-030-28374-2_2(13-25)Online publication date: 9-Aug-2019
https://doi.org/10.1007/978-3-030-28374-2_2
Cormode GDasgupta AGoyal ALee C(2018)An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logsPLOS ONE10.1371/journal.pone.019117513:1(e0191175)Online publication date: 18-Jan-2018
https://doi.org/10.1371/journal.pone.0191175
Sandnes F(2018)Improving the Robustness to Input Errors on Touch-Based Self-service Kiosks and Transportation AppsComputers Helping People with Special Needs10.1007/978-3-319-94277-3_50(311-319)Online publication date: 26-Jun-2018
https://doi.org/10.1007/978-3-319-94277-3_50
Sristy NKrishna NKrishna BRavi V(2017)Language Identification in Mixed ScriptProceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3158354.3158357(14-20)Online publication date: 8-Dec-2017
https://dl.acm.org/doi/10.1145/3158354.3158357
Hagen MPotthast MGohsen MRathgeber AStein BKando NSakai TJoho HLi Hde Vries AWhite R(2017)A Large-Scale Query Spelling Correction CorpusProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080749(1261-1264)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080749
Che Alhadi ADeraman AAbdul Jalil MWan Yussof WMohamed A(2017)An Ensemble Similarity Model for Short Text RetrievalComputational Science and Its Applications – ICCSA 201710.1007/978-3-319-62392-4_2(20-29)Online publication date: 6-Jul-2017
https://doi.org/10.1007/978-3-319-62392-4_2
Goyal AGao JDeng HChang YBennett PJosifovski VNeville JRadlinski F(2016)Query Understanding for Search on All Devices at WSDM 2016Proceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2855115(691-692)Online publication date: 8-Feb-2016
https://dl.acm.org/doi/10.1145/2835776.2855115
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten