skip to main content
article

Automatic classification of Web queries using very large unlabeled query logs

Published: 01 April 2007 Publication History

Abstract

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.

References

[1]
Adamo, J.-M. 2000. Data Mining for Association Rules and Sequential Patterns: Sequential and Parallel Algorithms. Springer, Berlin, Germany.
[2]
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 407--416.
[3]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized Web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 321--328.
[4]
Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., Kolcz, A., and Frieder, O. 2005a. Improving automatic query classification via semi-supervised learning. In Proceedings of the Fifth IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA, 42--49.
[5]
Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., Kolcz, A., Frieder, O., and Grossman, D. 2005b. Automatic Web query classification using labeled and unlabeled training data. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 581--582.
[6]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O., and Grossman, D. 2006. Temporal analysis of a very large topically categorized web query log. J. Amer. Soc. Inform Sci. Tech. To appear.
[7]
Bot, R. S., Wu, Y.-F. B., Chen, X., and Li, Q. 2005. Generating better concept hierarchies using automatic document classification. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 281--282.
[8]
Broder, A. 2002. A taxonomy of Web search. SIGIR For. 36, 2 (Fall), 3--10.
[9]
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International World Wide Web Conference. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 161--172.
[10]
Chowdhury, A. and Pass, G. 2003. Operational requirements for scalable search systems. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 435--442.
[11]
Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. Wiley-Interscience, New York, NY.
[12]
Craswell, N. and Hawking, D. 2004. Overview of the trec 2004 Web track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004). NIST, Gaithersburg, MD. 89--97.
[13]
Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the TREC 2003 Web track. In Proceedings of the Twelfth Text Retrieval Conference (TREC 2003). NIST, Gaithersburg, MD, 78--92.
[14]
Das-Neves, F., Fox, E. A., and Yu, X. 2005. Connecting topics in document collections with stepping stones and pathways. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 91--98.
[15]
Eastman, C. M. and Jansen, B. J. 2003. Coverage, relevance, and ranking: The impact of query operators on Web search engine results. ACM Trans. Inform. Syst. 21, 4 (Oct.), 383--411.
[16]
Glover, E. J., Lawrence, S., Birmingham, W. P., and Giles, C. L. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the 8th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 210--216.
[17]
Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. 2003. Categorizing web queries according to geographical locality. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 325--333.
[18]
Grossman, D. A. and Frieder, O. 2004. Information retrieval: Algorithms and Heuristics. Springer, Berlin, Germany.
[19]
Jansen, B. J., Spink, A., and Pederson, J. 2005. A temporal comparison of altavista Web searching. J. Amer. Soc. Inform. Sci. Techno. 56, 6, 559--570.
[20]
Jansen, B. J., Spink, A., and Saracevic, T. 2000. Real life, real users, and real needs: A study and analysis of user queries on the web. Inform. Process. Manage. 36, 2 (Mar.), 207--227.
[21]
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods---Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, Eds. MIT Press, Cambridge, MA.
[22]
Kang, I.-H. and Kim, G. 2003. Query type classification for web document retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 64--71.
[23]
Kardkovacs, Z. T., Tikk, D., and Bansaghi, Z. 2005. The ferrety algorithm for the KDD Cup 2005 problem. SIGKDD Explor. 7, 2 (Dec.), 111--116.
[24]
Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In 17th Australian Joint Conference on Artificial Intelligence (AI-04). Springer-Verlag, Berlin, Germany, 550--561.
[25]
Krauth, W. and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A 20, 745--752.
[26]
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web Science, 98--100.
[27]
Lewis, D. D. 1995. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 246--254.
[28]
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., and Kandola, J. S. 2002. The perceptron algorithm with uneven margins. In Proceedings of the 19th International Conference on Maching Learning. Morgan Kaufmann San Francisco, CA, 379--386.
[29]
Light, M. and Greiff, W. 2002. Statistical models for the induction and use of selectional preferences. Cog. Sci. 26, 3 269--281.
[30]
Manmatha, R., Feng, A., and Allan, J. 2002. A critical examination of TDT's cost function. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 403--404.
[31]
Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
[32]
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. 1997. The DET curve in assessment of detection task performance. In Proceedings of the 5th ESCA Conference on Speech Communication and Technology (Eurospeech '97), (Sept.). 1895--1898.
[33]
McCarthy, D. and Carroll, J. 2003. Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences. Computat. Ling. 29, 4 (Dec.), 639--654.
[34]
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York, NY.
[35]
Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web? The evolution of the web from a search engine perspective. In Proceedings of the 13th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 1--12.
[36]
Resnik, P. 1993. Selection and information: A class-based approach to lexical relationships. Unpublished manuscript. University of Pennsylvania, Philadelphia, PA.
[37]
Salton, G., Yang, C. S., and Wong, A. 1975. A vector-space model for automatic indexing. Commun. ACM 18, 11 (Nov.), 613--620.
[38]
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47.
[39]
Shen, D., Pan, R., Sun, J.-T., Pan, J. J., Wu, K., Yin, J., and Yang, Q. 2005. Qˆ2c@ust: Our winning solution to query classification in KDD Cup 2005. SIGKDD Explor. 7, 2 (Dec.), 100--110.
[40]
Spink, A. and Jansen, B. J. 2004. Web Search: Public Searching of the Web. Springer, Berlin, Germany.
[41]
Spink, A., Jansen, B. J., Wolfram, D., and Saracevic, T. 2002a. From e-sex to e-commerce: Web search changes. IEEE Comput. 35, 3 (Mar.), 107--109.
[42]
Spink, A., Ozmutlu, S., Ozmutlu, H. C., and Jansen, B. J. 2002b. U.S. versus European Web searching trends. SIGIR For. 36, 2 (Fall), 32--38.
[43]
Spink, A., Wolfram, D., Jansen, B. J., and Saracevic, T. 2001. Searching the Web: The public and their queries. J. Amer. Soc. Inform. Sci. Tech. 52, 3, 226--234.
[44]
Sullivan, D. 2006. Searches per day. Search Engine Watch. Go online to http://searchenginewatch.com/reports/article.php/2156461.
[45]
Tague, J. M. 1981. The pragmatics of information retrieval experimentation. In Information Retrieval Experiment, K. S. Jones, Ed. Butterworth-Heinemann, London, U.K. 59--102.
[46]
van Rijsbergen, C. J. 1979. Information Retrieval. Butterworth-Heinemann, London, U.K.
[47]
Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the Web as background knowledge. SIGKDD Explor. 7, 2 (Dec.), 117--122.
[48]
Voorhees, E. M. 2004. Overview of the TREC 2004 question answering track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004, Nov.). NIST, Gaitheraburg, MD.
[49]
Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2001a. Clustering user queries of a search engine. In Proceedings of the 10th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 162--168.
[50]
Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2001b. Query clustering using content words and user feedback. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 442--443.
[51]
Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inform. Sys. 20, 1 (Jan.), 59--81.

Cited By

View all

Index Terms

  1. Automatic classification of Web queries using very large unlabeled query logs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 25, Issue 2
      April 2007
      141 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/1229179
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 April 2007
      Published in TOIS Volume 25, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
      • (2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
      • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
      • (2021)Adaptive utterance rewriting for conversational searchInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268258:6Online publication date: 1-Nov-2021
      • (2020)RETRACTED ARTICLE: Automated query classification based web service similarity technique using machine learningJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-020-02186-612:6(6169-6180)Online publication date: 15-Jun-2020
      • (2020)A survey on intention analysis: successful approaches and open challengesJournal of Intelligent Information Systems10.1007/s10844-020-00604-x55:3(423-443)Online publication date: 1-Dec-2020
      • (2020)Query ClassificationQuery Understanding for Search Engines10.1007/978-3-030-58334-7_2(15-41)Online publication date: 2-Dec-2020
      • (2019)A Document Ranking Method With Query-Related Web ContextIEEE Access10.1109/ACCESS.2019.29471667(150168-150174)Online publication date: 2019
      • (2018)Mining user queries with information extraction methods and linked dataJournal of Documentation10.1108/JD-09-2017-013374:5(936-950)Online publication date: 10-Sep-2018
      • (2017)Gaze movement-driven random forests for query clustering in automatic video annotationMultimedia Tools and Applications10.5555/3048787.304883776:2(2861-2889)Online publication date: 1-Jan-2017
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media