poster

An algorithmic treatment of strong queries

Authors:

Silvio Lattanzi,

Prabhakar RaghavanAuthors Info & Claims

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 775 - 784

https://doi.org/10.1145/1935826.1935928

Published: 09 February 2011 Publication History

Abstract

A strong query for a target document with respect to an index is the smallest query for which the target document is returned by the index as the top result for the query. The strong query problem was first studied more than a decade ago in the context of measuring search engine overlap. Despite its simple-to-state nature and its longevity in the field, this problem has not been sufficiently addressed in a formal manner.

In this paper we provide the first rigorous treatment of the strong query problem. We show an interesting connection between this problem and the set cover problem, and use it to obtain basic hardness and algorithmic results. Experiments on more than 10K documents show that our proposed algorithm performs much better than the widely-used word frequency-based heuristic. En route, our study suggests that less than four words on average can be sufficient to uniquely identify web pages.

References

[1]

E. Amigó, J. Gonzalo, V. Peinado, A. Penas, and F. Verdejo. Using syntactic information to extract relevant terms for multi-document summarization. In Proc. 20th COLING, 2004.

Digital Library

[2]

K. Bharat and A. Broder. Estimating the relative size and overlap of public web search engines. In Proc. 7th WWW, pages 512--523, 1998.

Digital Library

[3]

J. Blot, W. F. de la Vega, V. T. Paschos, and R. Saad. Average case analysis of greedy algorithms for optimisation problems on set systems. TCS, 147(1&2):267--298, 1995.

Digital Library

[4]

P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th WWW, pages 595--602, 2004.

Digital Library

[5]

G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In Proc. 1st WSDM, pages 95--106, 2008.

Digital Library

[6]

F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In Proc. 19th WWW, pages 231--240, 2010.

Digital Library

[7]

A. Dasdan, P. D'Alberto, S. Kolay, and C. Drome. Automatic retrieval of similar content using search engine query interface. In Proc. 18th CIKM, pages 701--710, 2009.

Digital Library

[8]

J. Davila and S. Rajasekaran. A note on the probabilistic analysis of the minimum set cover problem. Computing Letters., 2006.

[9]

J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi. Web document summarization by context. In Proc. 12th WWW (Posters), 2003.

[10]

D. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press, 2009.

Digital Library

[11]

U. Feige. A threshold of In n for approximating set cover. JACM, 45(4):634--652, 1998.

Digital Library

[12]

J. Håstad. Clique is hard to approximate within n ^1-ε. Acta Mathematica, 182:105--142, 1999.

[13]

D. S. Johnson. Approximation algorithms for combinatorial problems. In Proc. 5th STOC, pages 38--49, 1973.

Digital Library

[14]

R. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.

[15]

Y. Ko, H. An, and J. Seo. An effective snippet generation method using the pseudo relevance feedback technique. In Proc. 30th SIGIR, pages 711--712, 2007.

Digital Library

[16]

L. Lovász. On the ratio of optimal integral and fractional covers. Discrete Mathematics, 13:383--390, 1975.

Digital Library

[17]

H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development., 1(4):309--317, 1957.

Digital Library

[18]

A. Pereira and N. Ziviani. Retrieving similar documents from the web. J. Web Engineering., 2(4):247--261, 2004.

Digital Library

[19]

R. Raz and S. Safra. A sub-constant error-probability low-degree test, and sub-constant error-probability PCP characterization of NP. In Proc. 28th STOC, pages 475--484, 1997.

Digital Library

[20]

P. Slavík. A tight analysis of the greedy algorithm for set cover. In Proc. 29th STOC, pages 435--441, 1996.

Digital Library

[21]

O. Telelis and V. Zissimopoulos. Absolute o(log m) error in approximating random set covering: An average case analysis. IPL, 94(4):171--177, 2005.

Digital Library

[22]

A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast generation of result snippets in web search. In Proc. 30th SIGIR, pages 127--134, 2007.

Digital Library

[23]

Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. 2nd WSDM, pages 34--43, 2009.

Digital Library

Cited By

Penha GPalumbo EAziz MWang ABouchard H(2023)Improving Content Retrievability in Search with Controllable Query GenerationProceedings of the ACM Web Conference 202310.1145/3543507.3583261(3182-3192)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583261
Liu BLu XCulpepper J(2021)Strong natural language query generationInformation Retrieval Journal10.1007/s10791-021-09395-3Online publication date: 15-Jul-2021
https://doi.org/10.1007/s10791-021-09395-3

Index Terms

An algorithmic treatment of strong queries
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Materialized views selection for answering queries
ICDEM'10: Proceedings of the Second international conference on Data Engineering and Management

A data warehouse stores historical data to support analytical query processing. These analytical queries are long and complex and processing these against a large data warehouse consumes a lot of time. As a result, the query response time is high. One ...
A Strong Containment Problem for Queries in Conjunctive Form with Negation
DBKDA '09: Proceedings of the 2009 First International Conference on Advances in Databases, Knowledge, and Data Applications

In this paper we define a new notion of containment for two queries, called strong containment. The strong containment implies the classical containment. A necessary and sufficient condition for the strong containment relation between two queries is ...
Automatic retrieval of similar content using search engine query interface
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

February 2011

870 pages

ISBN:9781450304931

DOI:10.1145/1935826

General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

WSDM'11

Sponsor:

WSDM'11: Fourth ACM International Conference on Web Search and Data Mining

February 9 - 12, 2011

Hong Kong, China

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Penha GPalumbo EAziz MWang ABouchard H(2023)Improving Content Retrievability in Search with Controllable Query GenerationProceedings of the ACM Web Conference 202310.1145/3543507.3583261(3182-3192)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583261
Liu BLu XCulpepper J(2021)Strong natural language query generationInformation Retrieval Journal10.1007/s10791-021-09395-3Online publication date: 15-Jul-2021
https://doi.org/10.1007/s10791-021-09395-3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten