ABSTRACT
Recently, there has been growing interest in random sampling from online hidden databases. These databases reside behind form-like web interfaces which allow users to execute search queries by specifying the desired values for certain attributes, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable scoring function. In this paper, we consider the problem of uniform random sampling over such hidden databases. A key challenge is to eliminate the skew of samples incurred by the selective return of highly ranked tuples. To address this challenge, all state-of-the-art samplers share a common approach: they do not use overflowing queries. This is done in order to avoid favoring highly ranked tuples and thus incurring high skew in the retrieved samples. However, not considering overflowing queries substantially impacts sampling efficiency.
In this paper, we propose novel sampling techniques which do leverage overflowing queries. As a result, we are able to significantly improve sampling efficiency over the state-of-the-art samplers, while at the same time substantially reduce the skew of generated samples. We conduct extensive experiments over synthetic and real-world databases to illustrate the superiority of our techniques over the existing ones.
- M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and V. Carneiro. Crawling the content hidden behind web forms. In ICCSA, 2007. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarDigital Library
- D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. IEEE Data Engineering Bulletin, 20(4):3--45, 1997.Google Scholar
- L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.Google Scholar
- K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW, 1998. Google ScholarDigital Library
- N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, 2002.Google ScholarCross Ref
- J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001. Google ScholarDigital Library
- K. C.-C. Chang and S. won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. Google ScholarDigital Library
- S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007. Google ScholarDigital Library
- G. Das. Survey of approximate query processing techniques (tutorial). In SSDBM, 2003.Google Scholar
- A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarDigital Library
- A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarDigital Library
- A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri. Privacy preservation of aggregates in hidden databases: Why and how? In SIGMOD, 2009. Google ScholarDigital Library
- M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Google ScholarDigital Library
- Google Base. http://base.google.com.Google Scholar
- Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. A two-phase sampling technique for information extraction from hidden web databases. In WIDM, 2004. Google ScholarDigital Library
- Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. Sampling, information extraction and summarisation of hidden web databases. Data and Knowledge Engineering, 59(2):213--230, 2006. Google ScholarDigital Library
- S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In ER (Workshops), 2002.Google Scholar
- L. G. Panagiotis G. Ipeirotis. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, 2001. Google ScholarDigital Library
- Yahoo! Auto. http://auto.yahoo.com.Google Scholar
Recommendations
Turbo-charging estimate convergence in DBO
DBO is a database system that utilizes randomized algorithms to give statistically meaningful estimates for the final answer to a multi-table, disk-based query from start to finish during query execution. However, DBO's "time 'til utility" (or "TTU"; ...
Answering imprecise database queries: a novel approach
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data managementA growing number of databases especially those published on the Web are becoming available to external users. Users of these databases are provided simple form-based query interfaces that hide the underlying schematic details. Constrained by the ...
Global Top-k Aggregate Queries Based on X-tuple in Uncertain Database
WAINA '10: Proceedings of the 2010 IEEE 24th International Conference on Advanced Information Networking and Applications WorkshopsA Top-k aggregate query, which is a powerful technique when dealing with large quantity of data, ranks groups of tuples by their aggregate values and returns k groups with the highest aggregate values. However, compared to Top-k in traditional databases,...
Comments