article

Automatic classification of Web queries using very large unlabeled query logs

Authors:

Steven M. Beitzel,

Eric C. Jensen,

David D. Lewis,

Abdur Chowdhury,

Ophir FriederAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 25, Issue 2

Pages 9 - es

https://doi.org/10.1145/1229179.1229183

Published: 01 April 2007 Publication History

Abstract

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.

References

[1]

Adamo, J.-M. 2000. Data Mining for Association Rules and Sequential Patterns: Sequential and Parallel Algorithms. Springer, Berlin, Germany.

[2]

Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 407--416.

[3]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized Web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 321--328.

[4]

Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., Kolcz, A., and Frieder, O. 2005a. Improving automatic query classification via semi-supervised learning. In Proceedings of the Fifth IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA, 42--49.

[5]

Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., Kolcz, A., Frieder, O., and Grossman, D. 2005b. Automatic Web query classification using labeled and unlabeled training data. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 581--582.

[6]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O., and Grossman, D. 2006. Temporal analysis of a very large topically categorized web query log. J. Amer. Soc. Inform Sci. Tech. To appear.

[7]

Bot, R. S., Wu, Y.-F. B., Chen, X., and Li, Q. 2005. Generating better concept hierarchies using automatic document classification. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 281--282.

[8]

Broder, A. 2002. A taxonomy of Web search. SIGIR For. 36, 2 (Fall), 3--10.

[9]

Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International World Wide Web Conference. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 161--172.

[10]

Chowdhury, A. and Pass, G. 2003. Operational requirements for scalable search systems. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 435--442.

[11]

Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. Wiley-Interscience, New York, NY.

[12]

Craswell, N. and Hawking, D. 2004. Overview of the trec 2004 Web track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004). NIST, Gaithersburg, MD. 89--97.

[13]

Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the TREC 2003 Web track. In Proceedings of the Twelfth Text Retrieval Conference (TREC 2003). NIST, Gaithersburg, MD, 78--92.

[14]

Das-Neves, F., Fox, E. A., and Yu, X. 2005. Connecting topics in document collections with stepping stones and pathways. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 91--98.

[15]

Eastman, C. M. and Jansen, B. J. 2003. Coverage, relevance, and ranking: The impact of query operators on Web search engine results. ACM Trans. Inform. Syst. 21, 4 (Oct.), 383--411.

[16]

Glover, E. J., Lawrence, S., Birmingham, W. P., and Giles, C. L. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the 8th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 210--216.

[17]

Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. 2003. Categorizing web queries according to geographical locality. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 325--333.

[18]

Grossman, D. A. and Frieder, O. 2004. Information retrieval: Algorithms and Heuristics. Springer, Berlin, Germany.

[19]

Jansen, B. J., Spink, A., and Pederson, J. 2005. A temporal comparison of altavista Web searching. J. Amer. Soc. Inform. Sci. Techno. 56, 6, 559--570.

[20]

Jansen, B. J., Spink, A., and Saracevic, T. 2000. Real life, real users, and real needs: A study and analysis of user queries on the web. Inform. Process. Manage. 36, 2 (Mar.), 207--227.

[21]

Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods---Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, Eds. MIT Press, Cambridge, MA.

[22]

Kang, I.-H. and Kim, G. 2003. Query type classification for web document retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 64--71.

[23]

Kardkovacs, Z. T., Tikk, D., and Bansaghi, Z. 2005. The ferrety algorithm for the KDD Cup 2005 problem. SIGKDD Explor. 7, 2 (Dec.), 111--116.

[24]

Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In 17th Australian Joint Conference on Artificial Intelligence (AI-04). Springer-Verlag, Berlin, Germany, 550--561.

[25]

Krauth, W. and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A 20, 745--752.

[26]

Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web Science, 98--100.

[27]

Lewis, D. D. 1995. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 246--254.

[28]

Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., and Kandola, J. S. 2002. The perceptron algorithm with uneven margins. In Proceedings of the 19th International Conference on Maching Learning. Morgan Kaufmann San Francisco, CA, 379--386.

[29]

Light, M. and Greiff, W. 2002. Statistical models for the induction and use of selectional preferences. Cog. Sci. 26, 3 269--281.

[30]

Manmatha, R., Feng, A., and Allan, J. 2002. A critical examination of TDT's cost function. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 403--404.

[31]

Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.

[32]

Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. 1997. The DET curve in assessment of detection task performance. In Proceedings of the 5th ESCA Conference on Speech Communication and Technology (Eurospeech '97), (Sept.). 1895--1898.

[33]

McCarthy, D. and Carroll, J. 2003. Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences. Computat. Ling. 29, 4 (Dec.), 639--654.

[34]

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York, NY.

[35]

Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web&quest; The evolution of the web from a search engine perspective. In Proceedings of the 13th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 1--12.

[36]

Resnik, P. 1993. Selection and information: A class-based approach to lexical relationships. Unpublished manuscript. University of Pennsylvania, Philadelphia, PA.

[37]

Salton, G., Yang, C. S., and Wong, A. 1975. A vector-space model for automatic indexing. Commun. ACM 18, 11 (Nov.), 613--620.

[38]

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47.

[39]

Shen, D., Pan, R., Sun, J.-T., Pan, J. J., Wu, K., Yin, J., and Yang, Q. 2005. Q&circ;2c&commat;ust: Our winning solution to query classification in KDD Cup 2005. SIGKDD Explor. 7, 2 (Dec.), 100--110.

[40]

Spink, A. and Jansen, B. J. 2004. Web Search: Public Searching of the Web. Springer, Berlin, Germany.

[41]

Spink, A., Jansen, B. J., Wolfram, D., and Saracevic, T. 2002a. From e-sex to e-commerce: Web search changes. IEEE Comput. 35, 3 (Mar.), 107--109.

[42]

Spink, A., Ozmutlu, S., Ozmutlu, H. C., and Jansen, B. J. 2002b. U.S. versus European Web searching trends. SIGIR For. 36, 2 (Fall), 32--38.

[43]

Spink, A., Wolfram, D., Jansen, B. J., and Saracevic, T. 2001. Searching the Web: The public and their queries. J. Amer. Soc. Inform. Sci. Tech. 52, 3, 226--234.

[44]

Sullivan, D. 2006. Searches per day. Search Engine Watch. Go online to http://searchenginewatch.com/reports/article.php/2156461.

[45]

Tague, J. M. 1981. The pragmatics of information retrieval experimentation. In Information Retrieval Experiment, K. S. Jones, Ed. Butterworth-Heinemann, London, U.K. 59--102.

[46]

van Rijsbergen, C. J. 1979. Information Retrieval. Butterworth-Heinemann, London, U.K.

[47]

Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the Web as background knowledge. SIGKDD Explor. 7, 2 (Dec.), 117--122.

[48]

Voorhees, E. M. 2004. Overview of the TREC 2004 question answering track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004, Nov.). NIST, Gaitheraburg, MD.

[49]

Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2001a. Clustering user queries of a search engine. In Proceedings of the 10th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 162--168.

[50]

Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2001b. Query clustering using content words and user feedback. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 442--443.

[51]

Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inform. Sys. 20, 1 (Jan.), 59--81.

Cited By

Agrawal SKadam KMehta JHole V(2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
https://doi.org/10.1007/978-981-99-7962-2_12
Zhang JLi XWang L(2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1145/3587265
Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Show More Cited By

Index Terms

Automatic classification of Web queries using very large unlabeled query logs
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Building bridges for web query classification
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Web query classification (QC) aims to classify Web users' queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and ...
Robust classification of rare queries using web knowledge
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a ...
Automatic web query classification using labeled and unlabeled training data
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 25, Issue 2

April 2007

141 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1229179

Issue’s Table of Contents

Copyright © 2007 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2007

Published in TOIS Volume 25, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
2,033
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Agrawal SKadam KMehta JHole V(2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
https://doi.org/10.1007/978-981-99-7962-2_12
Zhang JLi XWang L(2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1145/3587265
Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Mele IMuntean CNardini FPerego RTonellotto NFrieder O(2021)Adaptive utterance rewriting for conversational searchInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268258:6Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.ipm.2021.102682
Balaji BBalakrishnan SVenkatachalam KJeyakrishnan V(2020)RETRACTED ARTICLE: Automated query classification based web service similarity technique using machine learningJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-020-02186-612:6(6169-6180)Online publication date: 15-Jun-2020
https://doi.org/10.1007/s12652-020-02186-6
Hamroun MGouider M(2020)A survey on intention analysis: successful approaches and open challengesJournal of Intelligent Information Systems10.1007/s10844-020-00604-x55:3(423-443)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s10844-020-00604-x
Guo JLan Y(2020)Query ClassificationQuery Understanding for Search Engines10.1007/978-3-030-58334-7_2(15-41)Online publication date: 2-Dec-2020
https://doi.org/10.1007/978-3-030-58334-7_2
Kim J(2019)A Document Ranking Method With Query-Related Web ContextIEEE Access10.1109/ACCESS.2019.29471667(150168-150174)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2947166
Chardonnens ARizza ECoeckelbergs Mvan Hooland S(2018)Mining user queries with information extraction methods and linked dataJournal of Documentation10.1108/JD-09-2017-013374:5(936-950)Online publication date: 10-Sep-2018
https://doi.org/10.1108/JD-09-2017-0133
Vrochidis SPatras IKompatsiaris I(2017)Gaze movement-driven random forests for query clustering in automatic video annotationMultimedia Tools and Applications10.5555/3048787.304883776:2(2861-2889)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3048787.3048837
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents