skip to main content
10.1145/2505515.2505748acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Mining a search engine's corpus without a query pool

Authors Info & Claims
Published:27 October 2013Publication History

ABSTRACT

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface.

Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.

References

  1. Data intensive semantics and pragmatics project http://www.ltg.ed.ac.uk/disp/.Google ScholarGoogle Scholar
  2. Open directory project http://www.dmoz.org.Google ScholarGoogle Scholar
  3. https://developers.google.com/custom-search/v1/overview.Google ScholarGoogle Scholar
  4. E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.Google ScholarGoogle Scholar
  5. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Bar-Yossef and M. Gurevich. Mining search engine query logs via suggestion sampling. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Bar-Yossef and M. Gurevich. Estimating the impressionrank of web pages. In WWW, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Lovász. Random walks on graphs: A survey, 1993.Google ScholarGoogle Scholar
  15. A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Pelánek, T. Han\vzl, I.vCerná, and L. Brim. Enhancing random walk state space exploration. In FMICS '05.Google ScholarGoogle Scholar
  17. B. Ribeiro, P. Wang, F. Murai, and D. Towsley. Sampling directed graphs with random walks. In INFOCOM, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. S. Robsom and H. A. Regier. Sample size in petersen mark--recapture experiments. Transactions of the American Fisheries Society, 1964.Google ScholarGoogle Scholar
  19. K. Sahlin. Estimating convergence of markov chain monte carlo simulations. Stockholm University, Master Thesis, 2011.Google ScholarGoogle Scholar
  20. C. Sheng, N. Zhang, Y. Tao, and X. Jin. Optimal algorithms for crawling a hidden database in the web. VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ye and F. Wu. Estimating the size of online social networks. In SocialCom, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Zhang, N. Zhang, and G. Das. Aggregate suppression for enterprise search engines. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining a search engine's corpus without a query pool

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
      October 2013
      2612 pages
      ISBN:9781450322638
      DOI:10.1145/2505515

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader