Sampling Search-Engine Results

Anagnostopoulos, Aris; Broder, Andrei Z.; Carmel, David

doi:10.1007/s11280-006-0222-z

Sampling Search-Engine Results

Published: 16 January 2007

Volume 9, pages 397–429, (2006)
Cite this article

World Wide Web Aims and scope Submit manuscript

Aris Anagnostopoulos¹,
Andrei Z. Broder¹ &
David Carmel²

226 Accesses
18 Citations
3 Altmetric
Explore all metrics

Abstract

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:

Determining the set of categories in a given taxonomy spanned by the search results;
Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”;
Estimating the size of the result set;
Data mining associations to the query terms.

We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Aggarwal, C.C., Gates, S.C., Yu. P.S.: On using partial supervision for text categorization. IEEE Trans. Knowl. Data Eng. 16(2), 245–255 (2004)
Article Google Scholar
Amitay, E., Carmel, D., Lempel, R., Soffer, A.: Scaling {IR}-system evaluation using term relevance sets. In: Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval, pp. 10–17. ACM (2004)
Anagnostopoulos, A., Broder, A.Z., Carmel, D.: Sampling search-engine results. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 245–256. ACM (2005)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)
Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: Proceedings of the 8th Granada Seminar on Computational and Statistical Physics, Modeling Cooperative Behavior in the Social Sciences, pp. 81–89 (2005)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. WWW7/Computer Networks and ISDN Systems 30, 107–117 (April 1998)
Article Google Scholar
Broder, A.Z.: A taxonomy of Web search. SIGIR forum 36(2), 3–10 (2002)
Google Scholar
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)
Burrows, M.: Sequential searching of a database index using constraints on word-location pairs. US Patent 5 745 890 1998
Carmel, D., Amitay, E., Herscovici, M., Maarek, Y.S., Petruschka, Y., Soffer, A.: Juru at TREC 10—Experiments with Index Pruning. In: Proceedings of the Tenth Text REtrieval Conference (TREC-10). National Institute of Standards and Technology (NIST) (2001)
Devroye, L.: Non-Uniform Random Variate Generation. Springer, Berlin Heidelberg New York (1986)
Fallows, D., Rainie, L., Mudd, G.: The popularity and importance of search engines, August 2004. The Pew Internet & American Life Project, http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf
Fontoura, M., Shekita, E.J., Zien, J.Y., Rajagopalan, S., Neumann, A.: High performance index build algorithms for intranet search engines. In: VLDB 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 1158–1169. Morgan Kaufmann (2004)
Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In SPAA ’01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291. ACM (2001)
Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a WebFountain: an architecture for very large-scale text analytics. IBM Syst. J. 43(1) (2004)
Gulli, A., Signorini, A.: The indexable Web is more than 11.5 billion pages. In: WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 902–903. ACM (2005)
Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for join selectivity estimation. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 14–24. ACM (1994)
Li, K.-H.: Reservoir-sampling algorithms of time complexity \(O{\left( {n{\left( {1 + \log {\left( {N \mathord{\left/ {\vphantom {N n}} \right. \kern-\nulldelimiterspace} n} \right)}} \right)}} \right)}\). ACM Trans. Math. Softw. 20(4), 481–493 (1994)
Article Google Scholar
Muthukrishnan, S.: Data streams: algorithms and applications. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA-03), pp. 413–413. ACM (2003)
Radev, D.R., Qi, H., Zheng, Z., Blair-Goldensohn, S., Zhang, Z., Fan, W., Prager, J.: Mining the Web for answers to natural language questions. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 143–150. ACM (2001)
Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large Web search engine query log. SIGIR forum 33(1), 6–12 (1999)
Google Scholar
Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)
Article Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Williams, D.: Probability with Martingales. Cambridge University Press (1991)
Yee, K.-P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: Proceedings of the Conference on Human Factors in Computing Systems, pp. 401–408. ACM (2003)

Download references

Author information

Authors and Affiliations

Yahoo! Research, 701 First Avenue, Sunnyvale, CA, 94089, USA
Aris Anagnostopoulos & Andrei Z. Broder
IBM Haifa Research Lab, Haifa, 31905, Israel
David Carmel

Authors

Aris Anagnostopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Z. Broder
View author publications
You can also search for this author in PubMed Google Scholar
David Carmel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aris Anagnostopoulos.

Additional information

A preliminary version of this work has appeared in [3].

Work performed while A. Anagnostopoulos and A.Z. Broder were at IBM T. J. Watson Research Center.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anagnostopoulos, A., Broder, A.Z. & Carmel, D. Sampling Search-Engine Results. World Wide Web 9, 397–429 (2006). https://doi.org/10.1007/s11280-006-0222-z

Download citation

Received: 11 November 2005
Revised: 09 April 2006
Accepted: 13 April 2006
Published: 16 January 2007
Issue Date: December 2006
DOI: https://doi.org/10.1007/s11280-006-0222-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sampling Search-Engine Results

Abstract

Access this article

Similar content being viewed by others

Stable and semi-stable sampling approaches for continuously used samples

Consistent Subset Sampling

Stream sampling over windows with worst-case optimality and $$\ell $$ -overlap independence

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sampling Search-Engine Results

Abstract

Access this article

Similar content being viewed by others

Stable and semi-stable sampling approaches for continuously used samples

Consistent Subset Sampling

Stream sampling over windows with worst-case optimality and $$\ell $$ -overlap independence

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation