skip to main content
10.1145/1376616.1376695acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Sampling cube: a framework for statistical olap over sampling data

Authors Info & Claims
Published:09 June 2008Publication History

ABSTRACT

Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results.

In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.

References

  1. C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In SIGMOD?01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB?06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD?99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In VLDB?05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Olap over imprecise data with domain constraints. In VLDB?07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD?03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bee-Chung Chen, Lei Chen, Yi Lin, and Raghu Ramakrishnan. Prediction cubes. In VLDB?05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR?07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE TKDE?03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational operator generalizing group-by, cross-tab and sub-totals. In ICDE?96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. In Journal of Machine Learning Research, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD?96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. L. Hays. Statistics. CBS College Publishing, New York, NY, 1981.Google ScholarGoogle Scholar
  14. L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient cube: How to summarize the semantics of a data cube. In VLDB?02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. V. S. Lakshmanan, J. Pei, and Y. Zhao. QC-Trees: An efficient summary structure for semantic OLAP. In SIGMOD?03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cubing approach. In VLDB?04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. SIGKDD Explorations, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Raman and J. M. Hellerstein. Potter?s wheel: An interactive data cleaning system. In VLDB?01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Sismanis and N. Roussopoulos. The complexity of fully materialized coalesced cubes. In VLDB?04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB?03. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sampling cube: a framework for statistical olap over sampling data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader