research-article

Mining a search engine's corpus without a query pool

Authors:
Mingyang Zhang

George Washington University, Washington D.C. , USA

George Washington University, Washington D.C. , USA
View Profile

,
Nan Zhang

George Washington University, Washington D.C. , USA

George Washington University, Washington D.C. , USA
View Profile

,
Gautam Das

University of Texas at Arlington, Qatar Computing Research Institute, Arlington, USA

University of Texas at Arlington, Qatar Computing Research Institute, Arlington, USA
View Profile

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementOctober 2013Pages 29–38https://doi.org/10.1145/2505515.2505748

Published:27 October 2013Publication History

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 29–38

ABSTRACT

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface.

Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.

References

Data intensive semantics and pragmatics project http://www.ltg.ed.ac.uk/disp/.Google Scholar
Open directory project http://www.dmoz.org.Google Scholar
https://developers.google.com/custom-search/v1/overview.Google Scholar
E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.Google Scholar
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Mining search engine query logs via suggestion sampling. In VLDB, 2008. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Estimating the impressionrank of web pages. In WWW, 2009. Google ScholarDigital Library
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 1998. Google ScholarDigital Library
A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarDigital Library
A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD, 2010. Google ScholarDigital Library
A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarDigital Library
L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, 2011. Google ScholarDigital Library
L. Lovász. Random walks on graphs: A survey, 1993.Google Scholar
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, 2005. Google ScholarDigital Library
R. Pelánek, T. Han\vzl, I.vCerná, and L. Brim. Enhancing random walk state space exploration. In FMICS '05.Google Scholar
B. Ribeiro, P. Wang, F. Murai, and D. Towsley. Sampling directed graphs with random walks. In INFOCOM, 2012.Google ScholarCross Ref
D. S. Robsom and H. A. Regier. Sample size in petersen mark--recapture experiments. Transactions of the American Fisheries Society, 1964.Google Scholar
K. Sahlin. Estimating convergence of markov chain monte carlo simulations. Stockholm University, Master Thesis, 2011.Google Scholar
C. Sheng, N. Zhang, Y. Tao, and X. Jin. Optimal algorithms for crawling a hidden database in the web. VLDB, 2012. Google ScholarDigital Library
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In ICDE, 2006. Google ScholarDigital Library
S. Ye and F. Wu. Estimating the size of online social networks. In SocialCom, 2010. Google ScholarDigital Library
M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD, 2011. Google ScholarDigital Library
M. Zhang, N. Zhang, and G. Das. Aggregate suppression for enterprise search engines. In SIGMOD, 2012. Google ScholarDigital Library

Index Terms

Mining a search engine's corpus without a query pool
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of ...
Read More
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
Aggregate suppression for enterprise search engines
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Many enterprise websites provide search engines to facilitate customer access to their underlying documents or data. With the web interface of such a search engine, a customer can specify one or a few keywords that he/she is interested in; and the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
sampling
search engine
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 473
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining a search engine's corpus without a query pool

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Re-ranking search results using query logs

Aggregate suppression for enterprise search engines