ABSTRACT
Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface.
Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.
- Data intensive semantics and pragmatics project http://www.ltg.ed.ac.uk/disp/.Google Scholar
- Open directory project http://www.dmoz.org.Google Scholar
- https://developers.google.com/custom-search/v1/overview.Google Scholar
- E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.Google Scholar
- Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Mining search engine query logs via suggestion sampling. In VLDB, 2008. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Estimating the impressionrank of web pages. In WWW, 2009. Google ScholarDigital Library
- K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 1998. Google ScholarDigital Library
- A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarDigital Library
- A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD, 2010. Google ScholarDigital Library
- A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarDigital Library
- L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, 2011. Google ScholarDigital Library
- L. Lovász. Random walks on graphs: A survey, 1993.Google Scholar
- A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, 2005. Google ScholarDigital Library
- R. Pelánek, T. Han\vzl, I.vCerná, and L. Brim. Enhancing random walk state space exploration. In FMICS '05.Google Scholar
- B. Ribeiro, P. Wang, F. Murai, and D. Towsley. Sampling directed graphs with random walks. In INFOCOM, 2012.Google ScholarCross Ref
- D. S. Robsom and H. A. Regier. Sample size in petersen mark--recapture experiments. Transactions of the American Fisheries Society, 1964.Google Scholar
- K. Sahlin. Estimating convergence of markov chain monte carlo simulations. Stockholm University, Master Thesis, 2011.Google Scholar
- C. Sheng, N. Zhang, Y. Tao, and X. Jin. Optimal algorithms for crawling a hidden database in the web. VLDB, 2012. Google ScholarDigital Library
- P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In ICDE, 2006. Google ScholarDigital Library
- S. Ye and F. Wu. Estimating the size of online social networks. In SocialCom, 2010. Google ScholarDigital Library
- M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD, 2011. Google ScholarDigital Library
- M. Zhang, N. Zhang, and G. Das. Aggregate suppression for enterprise search engines. In SIGMOD, 2012. Google ScholarDigital Library
Index Terms
- Mining a search engine's corpus without a query pool
Recommendations
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataSearch engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Aggregate suppression for enterprise search engines
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMany enterprise websites provide search engines to facilitate customer access to their underlying documents or data. With the web interface of such a search engine, a customer can specify one or a few keywords that he/she is interested in; and the ...
Comments