research-article

Turbo-charging hidden database samplers with overflowing queries and skew reduction

Authors:
Arjun Dasgupta

University of Texas at Arlington

University of Texas at Arlington
View Profile

,
Nan Zhang

George Washington University

George Washington University
View Profile

,
Gautam Das

University of Texas at Arlington

University of Texas at Arlington
View Profile

EDBT '10: Proceedings of the 13th International Conference on Extending Database TechnologyMarch 2010Pages 51–62https://doi.org/10.1145/1739041.1739051

Published:22 March 2010Publication History

EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology

Pages 51–62

ABSTRACT

Recently, there has been growing interest in random sampling from online hidden databases. These databases reside behind form-like web interfaces which allow users to execute search queries by specifying the desired values for certain attributes, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable scoring function. In this paper, we consider the problem of uniform random sampling over such hidden databases. A key challenge is to eliminate the skew of samples incurred by the selective return of highly ranked tuples. To address this challenge, all state-of-the-art samplers share a common approach: they do not use overflowing queries. This is done in order to avoid favoring highly ranked tuples and thus incurring high skew in the retrieved samples. However, not considering overflowing queries substantially impacts sampling efficiency.

In this paper, we propose novel sampling techniques which do leverage overflowing queries. As a result, we are able to significantly improve sampling efficiency over the state-of-the-art samplers, while at the same time substantially reduce the skew of generated samples. We conduct extensive experiments over synthetic and real-world databases to illustrate the superiority of our techniques over the existing ones.

References

M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and V. Carneiro. Crawling the content hidden behind web forms. In ICCSA, 2007. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarDigital Library
D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. IEEE Data Engineering Bulletin, 20(4):3--45, 1997.Google Scholar
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.Google Scholar
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW, 1998. Google ScholarDigital Library
N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, 2002.Google ScholarCross Ref
J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001. Google ScholarDigital Library
K. C.-C. Chang and S. won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007. Google ScholarDigital Library
G. Das. Survey of approximate query processing techniques (tutorial). In SSDBM, 2003.Google Scholar
A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarDigital Library
A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarDigital Library
A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri. Privacy preservation of aggregates in hidden databases: Why and how? In SIGMOD, 2009. Google ScholarDigital Library
M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Google ScholarDigital Library
Google Base. http://base.google.com.Google Scholar
Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. A two-phase sampling technique for information extraction from hidden web databases. In WIDM, 2004. Google ScholarDigital Library
Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. Sampling, information extraction and summarisation of hidden web databases. Data and Knowledge Engineering, 59(2):213--230, 2006. Google ScholarDigital Library
S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In ER (Workshops), 2002.Google Scholar
L. G. Panagiotis G. Ipeirotis. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002. Google ScholarDigital Library
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, 2001. Google ScholarDigital Library
Yahoo! Auto. http://auto.yahoo.com.Google Scholar

Recommendations

Turbo-charging estimate convergence in DBO

DBO is a database system that utilizes randomized algorithms to give statistically meaningful estimates for the final answer to a multi-table, disk-based query from start to finish during query execution. However, DBO's "time 'til utility" (or "TTU"; ...
Read More
Answering imprecise database queries: a novel approach
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management

A growing number of databases especially those published on the Web are becoming available to external users. Users of these databases are provided simple form-based query interfaces that hide the underlying schematic details. Constrained by the ...
Read More
Global Top-k Aggregate Queries Based on X-tuple in Uncertain Database
WAINA '10: Proceedings of the 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops

A Top-k aggregate query, which is a powerful technique when dealing with large quantity of data, ranks groups of tuples by their aggregate values and returns k groups with the highest aggregate values. However, compared to Top-k in traditional databases,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology
March 2010
741 pages
ISBN:9781605589459
DOI:10.1145/1739041
Editors:
Ioana Manolescu
INRIA, France
,
Stefano Spaccapietra
EPFL, Switzerland
,
Jens Teubner
ETH Zurich, Switzerland
,
Masaru Kitsuregawa
Tokyo University, Japan
,
Alain Leger
Orange - France Telecom R&D, France
,
Felix Naumann
Hasso Plattner Institute, Germany
,
Anastasia Ailamaki
EPFL, Switzerland
,
Fatma Ozcan
IBM Almaden Research Center
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Turbo-charging hidden database samplers with overflowing queries and skew reduction

EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Recommendations

Turbo-charging estimate convergence in DBO

Answering imprecise database queries: a novel approach

Global Top-k Aggregate Queries Based on X-tuple in Uncertain Database