ABSTRACT
Query spelling correction is a crucial component of modern search engines. Existing methods in the literature for search query spelling correction have two major drawbacks. First, they are unable to handle certain important types of spelling errors, such as concatenation and splitting. Second, they cannot efficiently evaluate all the candidate corrections due to the complex form of their scoring functions, and a heuristic filtering step must be applied to select a working set of top-K most promising candidates for final scoring, leading to non-optimal predictions. In this paper we address both limitations and propose a novel generalized Hidden Markov Model with discriminative training that can not only handle all the major types of spelling errors, including splitting and concatenation errors, in a single unified framework, but also efficiently evaluate all the candidate corrections to ensure the finding of a globally optimal correction. Experiments on two query spelling correction datasets demonstrate that the proposed generalized HMM is effective for correcting multiple types of spelling errors. The results also show that it significantly outperforms the current approach for generating top-K candidate corrections, making it a better first-stage filter to enable any other complex spelling correction algorithm to have access to a better working set of candidate corrections as well as to cover splitting and concatenation errors, which no existing method in academic literature can correct.
- http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx.Google Scholar
- F. Ahmad and G. Kondrak. Learning a spelling error model from search query logs. In HLT/EMNLP. The Association for Computational Linguistics,2005. Google ScholarDigital Library
- E. Brill and R. Moore. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2000. Google ScholarDigital Library
- Q. Chen, M. Li, and M. Zhou. Improving query spelling correction using web search results. In EMNLP-CoNLL, pages 181--189. ACL, 2007.Google Scholar
- S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.Google Scholar
- H. Duan and B.-J. P. Hsu. Online spelling correction for query completion. In Proceedings of the 20th international conference on Worldwide web, WWW '11, pages 117--126. 2011, ACM. Google ScholarDigital Library
- J. Gao, X. Li, D. Micol, C. Quirk, and X. Sun. A large scale ranker-based system for search query spelling correction. In C.-R. Huang and D. Jurafsky, editors, COLING, pages 358--366. 2010. Google ScholarDigital Library
- K. Kukich. Techniques for automatically correcting words in text. ACM computing surveys, 24(4), 1992. Google ScholarDigital Library
- M. J. D. Powell. An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155--162, 1964.Google ScholarCross Ref
- X. Sun, J. Gao, D. Micol, and C. Quirk. Learning phrase-based spelling error models from click through data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 266--274, Stroudsburg, PA, USA,2010. Google ScholarDigital Library
- K. Wang, C. Thrasher, and B.-J. P. Hsu. Web scale nlp: a case study on url word breaking. In Proceedings of the 20th international conference on Worldwide web, WWW '11, pages 357--366, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis. Using the web for language independent spell checking and autocorrection. In EMNLP, pages 890--899. ACL, 2009. Google ScholarDigital Library
- evenshtein, V I. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, 10(8), 707--710, 1966.Google Scholar
- inxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '96. ACM, New York, NY. Google ScholarDigital Library
- onggang Qiu and Hans-Peter Frei. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '93. ACM, New York, NY, USA, 160--169. Google ScholarDigital Library
- andar Mitra, Amit Singhal, and Chris Buckley. Improving automatic query expansion. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98. Google ScholarDigital Library
- B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceeding of the 17th international conference on World Wide Web, WWW '08, pages 347--356. 2008. Google ScholarDigital Library
- Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 285--294. 2011. Google ScholarDigital Library
- uanhui Wang, ChengXiang Zhai. Mining Term Association Patterns from Search Logs for Effective Query Reformulation. In CIKM'08. 479--488. Google ScholarDigital Library
- tephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, Volume 2 (COLING '96). Stroudsburg, PA, USA, 836--841. Google ScholarDigital Library
- .H. Juang. Hidden Markov models for speech recognition. In Technometrics, Vol. 33, No. 3, Aug., 1991. Google ScholarDigital Library
- . Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP '02, pages 1--8, Stroudsburg, PA, USA, 2002. Google ScholarDigital Library
- . R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.Google ScholarCross Ref
- J. Guo, G. Xu, H. Li, and X. Cheng. A unified and discriminative model for query refinement. In Proceedings of the 31st annual international ACM SIGIR, SIGIR '08, pages 379--386. 2008. Google ScholarDigital Library
- G. Luec. A data-driven approach for correcting search queries. In Spelling Alteration for Web Search Workshop, July 2011.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282--289, San Fransisco, 2001. Morgan Kaufmann. Google ScholarDigital Library
Index Terms
- A generalized hidden Markov model with discriminative training for query spelling correction
Recommendations
A Large-Scale Query Spelling Correction Corpus
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalWe present a new large-scale collection of 54,772 queries with manually annotated spelling corrections. For 9,170 of the queries (16.74%), spelling variants that are different to the original query are proposed. With its size, our new corpus is an order ...
Query spelling correction using multi-task learning
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThis paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were ...
CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebQuery spelling correction is an important component of modern search engines that can help users to express an information need more accurately and thus improve search quality. In this work we proposed and implemented an end-to-end speller correction ...
Comments