Skip to main content
Log in

Sampling Search-Engine Results

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:

  • Determining the set of categories in a given taxonomy spanned by the search results;

  • Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”;

  • Estimating the size of the result set;

  • Data mining associations to the query terms.

We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.C., Gates, S.C., Yu. P.S.: On using partial supervision for text categorization. IEEE Trans. Knowl. Data Eng. 16(2), 245–255 (2004)

    Article  Google Scholar 

  2. Amitay, E., Carmel, D., Lempel, R., Soffer, A.: Scaling {IR}-system evaluation using term relevance sets. In: Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval, pp. 10–17. ACM (2004)

  3. Anagnostopoulos, A., Broder, A.Z., Carmel, D.: Sampling search-engine results. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 245–256. ACM (2005)

  4. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)

  5. Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: Proceedings of the 8th Granada Seminar on Computational and Statistical Physics, Modeling Cooperative Behavior in the Social Sciences, pp. 81–89 (2005)

  6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. WWW7/Computer Networks and ISDN Systems 30, 107–117 (April 1998)

    Article  Google Scholar 

  7. Broder, A.Z.: A taxonomy of Web search. SIGIR forum 36(2), 3–10 (2002)

    Google Scholar 

  8. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)

  9. Burrows, M.: Sequential searching of a database index using constraints on word-location pairs. US Patent 5 745 890 1998

  10. Carmel, D., Amitay, E., Herscovici, M., Maarek, Y.S., Petruschka, Y., Soffer, A.: Juru at TREC 10—Experiments with Index Pruning. In: Proceedings of the Tenth Text REtrieval Conference (TREC-10). National Institute of Standards and Technology (NIST) (2001)

  11. Devroye, L.: Non-Uniform Random Variate Generation. Springer, Berlin Heidelberg New York (1986)

  12. Fallows, D., Rainie, L., Mudd, G.: The popularity and importance of search engines, August 2004. The Pew Internet & American Life Project, http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf

  13. Fontoura, M., Shekita, E.J., Zien, J.Y., Rajagopalan, S., Neumann, A.: High performance index build algorithms for intranet search engines. In: VLDB 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 1158–1169. Morgan Kaufmann (2004)

  14. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In SPAA ’01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291. ACM (2001)

  15. Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a WebFountain: an architecture for very large-scale text analytics. IBM Syst. J. 43(1) (2004)

  16. Gulli, A., Signorini, A.: The indexable Web is more than 11.5 billion pages. In: WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 902–903. ACM (2005)

  17. Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for join selectivity estimation. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 14–24. ACM (1994)

  18. Li, K.-H.: Reservoir-sampling algorithms of time complexity \(O{\left( {n{\left( {1 + \log {\left( {N \mathord{\left/ {\vphantom {N n}} \right. \kern-\nulldelimiterspace} n} \right)}} \right)}} \right)}\). ACM Trans. Math. Softw. 20(4), 481–493 (1994)

    Article  Google Scholar 

  19. Muthukrishnan, S.: Data streams: algorithms and applications. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA-03), pp. 413–413. ACM (2003)

  20. Radev, D.R., Qi, H., Zheng, Z., Blair-Goldensohn, S., Zhang, Z., Fan, W., Prager, J.: Mining the Web for answers to natural language questions. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 143–150. ACM (2001)

  21. Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large Web search engine query log. SIGIR forum 33(1), 6–12 (1999)

    Google Scholar 

  22. Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)

    Article  Google Scholar 

  23. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  24. Williams, D.: Probability with Martingales. Cambridge University Press (1991)

  25. Yee, K.-P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: Proceedings of the Conference on Human Factors in Computing Systems, pp. 401–408. ACM (2003)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aris Anagnostopoulos.

Additional information

A preliminary version of this work has appeared in [3].

Work performed while A. Anagnostopoulos and A.Z. Broder were at IBM T. J. Watson Research Center.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anagnostopoulos, A., Broder, A.Z. & Carmel, D. Sampling Search-Engine Results. World Wide Web 9, 397–429 (2006). https://doi.org/10.1007/s11280-006-0222-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-006-0222-z

Keywords

Navigation