ABSTRACT
Topical classification of user queries is critical for general-purpose web search systems. It is also a challenging task, due to the sparsity of query terms and the lack of labeled queries. On the other hand, search contexts embedded in query sessions and unlabeled queries free on the web have not been fully utilized in most query classification systems. In this work, we leverage these information to improve query classification accuracy.
We first incorporate search contexts into our framework using a Conditional Random Field (CRF) model. Discriminative training of CRFs is favored over the traditional maximum likelihood training because of its robustness to noise. We then adapt self-training with our model to exploit the information in unlabeled queries. By investigating different confidence measurements and model selection strategies, we effectively avoid the error-reinforcing nature of self-training. In extensive experiments on real search logs, we have averaged around 20% improvement in classification accuracy over other state-of-the-art baselines.
- S. Beitzel, E. Jensen, O. Frieder, D. Lewis, A. Chowdhury, and A. Kołcz. Improving automatic query classification via semi-supervised learning. In Proc. ICDM, pages 42--49, 2005. Google ScholarDigital Library
- M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs. Learning theory, pages 624--638, 2004.Google Scholar
- S. Benson, L. McInnes, J. Moré, and J. Sarich. TAO user manual (revision 1.9). Mathematics and Computer Science Division, Argonne National Laboratory, Tech. Rep. ANL/MCS-TM-242, 2005.Google Scholar
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. COLT, pages 92--100, 1998. Google ScholarDigital Library
- C. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121--167, 1998. Google ScholarDigital Library
- H. Cao, D. Hu, D. Shen, D. Jiang, J. Sun, E. Chen, and Q. Yang. Context-aware query classification. In Proc. SIGIR, pages 3--10, 2009. Google ScholarDigital Library
- L. Catledge and J. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN systems, 27(6):1065--1073, 1995. Google ScholarDigital Library
- O. Chapelle, B. Schölkopf, A. Zien, et al. Semi-supervised learning. MIT press Cambridge, MA, 2006. Google ScholarDigital Library
- M. Chen, C. Y., M. Brent, and A. Tenney. Gradient-Based Feature Selection for Conditional Random Fields and Its Applications in Computational Genetics. In Proc. ICTAI, pages 750--757, 2009. Google ScholarDigital Library
- B. Croft et al. The role of context and adaptation in user interfaces. Journal of Man-Machine Studies, 21(4):283--292, 1984. Google ScholarDigital Library
- H. Cui, J. Wen, J. Nie, and W. Ma. Probabilistic query expansion using query logs. In Proc. WWW, pages 325--332, 2002. Google ScholarDigital Library
- K. Gimpel and N. Smith. Softmax-margin crfs: Training log-linear models with cost functions. In Proc. ACL, pages 733--736, 2010. Google ScholarDigital Library
- A. Goker. Context learning in Okapi. Journal of Documentation, 53(1):80--83, 1997.Google ScholarCross Ref
- B. Jansen, A. Spink, C. Blakely, and S. Koshman. Defining a session on web search engines. Journal of the American Society for Information Science and Technology, 58(6):862--871, 2007. Google ScholarDigital Library
- F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proc. ACL, pages 209--216, 2006. Google ScholarDigital Library
- T. Joachims. Learning to classify text using support vector machines: Methods, theory, and algorithms. Computational Linguistics, 29(4):656--664, 2002.Google Scholar
- R. Jones and K. Klinkner. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proc. CIKM, pages 699--708, 2008. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, pages 282--289, 2001. Google ScholarDigital Library
- X. Li, Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008. Google ScholarDigital Library
- G. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proc. ICML, pages 593--600. ACM, 2007. Google ScholarDigital Library
- N. Seshadri and C. Sundberg. List Viterbi decoding algorithms with applications. Communications, IEEE Transactions on, 42(234):313--323, 2002.Google Scholar
- F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. Human Language Technology - NAACL, pages 134--141, 2003. Google ScholarDigital Library
- F. Sha and L. Saul. Large margin hidden Markov models for automatic speech recognition. In Proc. NIPS, pages 1249--1256, 2007.Google Scholar
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. In ACM SIGIR Forum, volume 33, pages 6--12, 1999. Google ScholarDigital Library
- C. Sutton and A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. Introduction to statistical relational learning, page 93, 2007.Google Scholar
- S. Talja, H. Keso, and T. Pietil\"ainen. The production of 'context' in information seeking research: a metatheoretical view. Information Processing and Management, 35(6):751--763, 1999. Google ScholarDigital Library
- B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Proc. NIPS, 2003.Google ScholarDigital Library
- I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Proc. ICML, page 104, 2004. Google ScholarDigital Library
- V. Vapnik and V. Vapnik. Statistical learning theory. Wiley New York, 1998.Google ScholarDigital Library
- D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. ACL, pages 189--196, 1995. Google ScholarDigital Library
- T. Zhang and F. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proc. ICML, pages 1191--1198, 2000. Google ScholarDigital Library
- X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006.Google Scholar
Index Terms
- Improving context-aware query classification via adaptive self-training
Recommendations
Context-aware query classification
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalUnderstanding users'search intent expressed through their search queries is crucial to Web search and online advertisement. Web query classification (QC) has been widely studied for this purpose. Most previous QC algorithms classify individual queries ...
Automatic web query classification using labeled and unlabeled training data
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAccurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general ...
Learning with click graph for query intent classification
Topical query classification, as one step toward understanding users' search intent, is gaining increasing attention in information retrieval. Previous works on this subject primarily focused on enrichment of query features, for example, by augmenting ...
Comments