skip to main content
10.1145/1739041.1739051acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Turbo-charging hidden database samplers with overflowing queries and skew reduction

Published:22 March 2010Publication History

ABSTRACT

Recently, there has been growing interest in random sampling from online hidden databases. These databases reside behind form-like web interfaces which allow users to execute search queries by specifying the desired values for certain attributes, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable scoring function. In this paper, we consider the problem of uniform random sampling over such hidden databases. A key challenge is to eliminate the skew of samples incurred by the selective return of highly ranked tuples. To address this challenge, all state-of-the-art samplers share a common approach: they do not use overflowing queries. This is done in order to avoid favoring highly ranked tuples and thus incurring high skew in the retrieved samples. However, not considering overflowing queries substantially impacts sampling efficiency.

In this paper, we propose novel sampling techniques which do leverage overflowing queries. As a result, we are able to significantly improve sampling efficiency over the state-of-the-art samplers, while at the same time substantially reduce the skew of generated samples. We conduct extensive experiments over synthetic and real-world databases to illustrate the superiority of our techniques over the existing ones.

References

  1. M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and V. Carneiro. Crawling the content hidden behind web forms. In ICCSA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. IEEE Data Engineering Bulletin, 20(4):3--45, 1997.Google ScholarGoogle Scholar
  5. L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.Google ScholarGoogle Scholar
  6. K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. C.-C. Chang and S. won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Das. Survey of approximate query processing techniques (tutorial). In SSDBM, 2003.Google ScholarGoogle Scholar
  12. A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri. Privacy preservation of aggregates in hidden databases: Why and how? In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google Base. http://base.google.com.Google ScholarGoogle Scholar
  17. Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. A two-phase sampling technique for information extraction from hidden web databases. In WIDM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y.-L. Hedley, M. Younas, A. E. James, and M. Sanderson. Sampling, information extraction and summarisation of hidden web databases. Data and Knowledge Engineering, 59(2):213--230, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In ER (Workshops), 2002.Google ScholarGoogle Scholar
  20. L. G. Panagiotis G. Ipeirotis. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yahoo! Auto. http://auto.yahoo.com.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology
    March 2010
    741 pages
    ISBN:9781605589459
    DOI:10.1145/1739041

    Copyright © 2010 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 March 2010

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate7of10submissions,70%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader