ABSTRACT
Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results.
In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.
- C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In SIGMOD?01. Google ScholarDigital Library
- O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB?06. Google ScholarDigital Library
- K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD?99. Google ScholarDigital Library
- D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In VLDB?05. Google ScholarDigital Library
- D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Olap over imprecise data with domain constraints. In VLDB?07. Google ScholarDigital Library
- Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD?03. Google ScholarDigital Library
- Bee-Chung Chen, Lei Chen, Yi Lin, and Raghu Ramakrishnan. Prediction cubes. In VLDB?05. Google ScholarDigital Library
- P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR?07. Google ScholarDigital Library
- H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE TKDE?03. Google ScholarDigital Library
- J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational operator generalizing group-by, cross-tab and sub-totals. In ICDE?96. Google ScholarDigital Library
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. In Journal of Machine Learning Research, 2003. Google ScholarDigital Library
- V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD?96. Google ScholarDigital Library
- W. L. Hays. Statistics. CBS College Publishing, New York, NY, 1981.Google Scholar
- L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient cube: How to summarize the semantics of a data cube. In VLDB?02. Google ScholarDigital Library
- L. V. S. Lakshmanan, J. Pei, and Y. Zhao. QC-Trees: An efficient summary structure for semantic OLAP. In SIGMOD?03. Google ScholarDigital Library
- X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cubing approach. In VLDB?04. Google ScholarDigital Library
- T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. SIGKDD Explorations, 2004. Google ScholarDigital Library
- V. Raman and J. M. Hellerstein. Potter?s wheel: An interactive data cleaning system. In VLDB?01. Google ScholarDigital Library
- Y. Sismanis and N. Roussopoulos. The complexity of fully materialized coalesced cubes. In VLDB?04. Google ScholarDigital Library
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann, 2005. Google ScholarDigital Library
- D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB?03. Google ScholarDigital Library
Index Terms
- Sampling cube: a framework for statistical olap over sampling data
Recommendations
Graph cube: on warehousing and OLAP multidimensional networks
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataWe consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online ...
AND/OR importance sampling
UAI'08: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial IntelligenceThe paper introduces AND/OR importance sampling for probabilistic graphical models. In contrast to importance sampling, AND/OR importance sampling caches samples in the AND/OR space and then extracts a new sample mean from the stored samples. We prove ...
Semi-closed cube: an effective approach to trading off data cube size and query response time
The results of data cube will occupy huge amount of disk space when the base table is of a large number of attributes. A new type of data cube, compact data cube like condensed cube and quotient cube, was proposed to solve the problem. It compresses ...
Comments