skip to main content
10.1145/1935826.1935928acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

An algorithmic treatment of strong queries

Published: 09 February 2011 Publication History

Abstract

A strong query for a target document with respect to an index is the smallest query for which the target document is returned by the index as the top result for the query. The strong query problem was first studied more than a decade ago in the context of measuring search engine overlap. Despite its simple-to-state nature and its longevity in the field, this problem has not been sufficiently addressed in a formal manner.
In this paper we provide the first rigorous treatment of the strong query problem. We show an interesting connection between this problem and the set cover problem, and use it to obtain basic hardness and algorithmic results. Experiments on more than 10K documents show that our proposed algorithm performs much better than the widely-used word frequency-based heuristic. En route, our study suggests that less than four words on average can be sufficient to uniquely identify web pages.

References

[1]
E. Amigó, J. Gonzalo, V. Peinado, A. Penas, and F. Verdejo. Using syntactic information to extract relevant terms for multi-document summarization. In Proc. 20th COLING, 2004.
[2]
K. Bharat and A. Broder. Estimating the relative size and overlap of public web search engines. In Proc. 7th WWW, pages 512--523, 1998.
[3]
J. Blot, W. F. de la Vega, V. T. Paschos, and R. Saad. Average case analysis of greedy algorithms for optimisation problems on set systems. TCS, 147(1&2):267--298, 1995.
[4]
P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th WWW, pages 595--602, 2004.
[5]
G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In Proc. 1st WSDM, pages 95--106, 2008.
[6]
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In Proc. 19th WWW, pages 231--240, 2010.
[7]
A. Dasdan, P. D'Alberto, S. Kolay, and C. Drome. Automatic retrieval of similar content using search engine query interface. In Proc. 18th CIKM, pages 701--710, 2009.
[8]
J. Davila and S. Rajasekaran. A note on the probabilistic analysis of the minimum set cover problem. Computing Letters., 2006.
[9]
J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi. Web document summarization by context. In Proc. 12th WWW (Posters), 2003.
[10]
D. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press, 2009.
[11]
U. Feige. A threshold of In n for approximating set cover. JACM, 45(4):634--652, 1998.
[12]
J. Håstad. Clique is hard to approximate within n 1-ε. Acta Mathematica, 182:105--142, 1999.
[13]
D. S. Johnson. Approximation algorithms for combinatorial problems. In Proc. 5th STOC, pages 38--49, 1973.
[14]
R. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.
[15]
Y. Ko, H. An, and J. Seo. An effective snippet generation method using the pseudo relevance feedback technique. In Proc. 30th SIGIR, pages 711--712, 2007.
[16]
L. Lovász. On the ratio of optimal integral and fractional covers. Discrete Mathematics, 13:383--390, 1975.
[17]
H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development., 1(4):309--317, 1957.
[18]
A. Pereira and N. Ziviani. Retrieving similar documents from the web. J. Web Engineering., 2(4):247--261, 2004.
[19]
R. Raz and S. Safra. A sub-constant error-probability low-degree test, and sub-constant error-probability PCP characterization of NP. In Proc. 28th STOC, pages 475--484, 1997.
[20]
P. Slavík. A tight analysis of the greedy algorithm for set cover. In Proc. 29th STOC, pages 435--441, 1996.
[21]
O. Telelis and V. Zissimopoulos. Absolute o(log m) error in approximating random set covering: An average case analysis. IPL, 94(4):171--177, 2005.
[22]
A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast generation of result snippets in web search. In Proc. 30th SIGIR, pages 127--134, 2007.
[23]
Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. 2nd WSDM, pages 34--43, 2009.

Cited By

View all
  • (2023)Improving Content Retrievability in Search with Controllable Query GenerationProceedings of the ACM Web Conference 202310.1145/3543507.3583261(3182-3192)Online publication date: 30-Apr-2023
  • (2021)Strong natural language query generationInformation Retrieval Journal10.1007/s10791-021-09395-3Online publication date: 15-Jul-2021

Index Terms

  1. An algorithmic treatment of strong queries

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
      February 2011
      870 pages
      ISBN:9781450304931
      DOI:10.1145/1935826
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 February 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. greedy algorithm
      2. set cover
      3. strong query

      Qualifiers

      • Poster

      Conference

      Acceptance Rates

      WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Improving Content Retrievability in Search with Controllable Query GenerationProceedings of the ACM Web Conference 202310.1145/3543507.3583261(3182-3192)Online publication date: 30-Apr-2023
      • (2021)Strong natural language query generationInformation Retrieval Journal10.1007/s10791-021-09395-3Online publication date: 15-Jul-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media