research-article

Sampling cube: a framework for statistical olap over sampling data

Authors:
Xiaolei Li

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Zhijun Yin

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jae-Gil Lee

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Yizhou Sun

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataJune 2008Pages 779–790https://doi.org/10.1145/1376616.1376695

Published:09 June 2008Publication History

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Pages 779–790

ABSTRACT

Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results.

In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.

References

C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In SIGMOD?01. Google ScholarDigital Library
O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB?06. Google ScholarDigital Library
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD?99. Google ScholarDigital Library
D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In VLDB?05. Google ScholarDigital Library
D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Olap over imprecise data with domain constraints. In VLDB?07. Google ScholarDigital Library
Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD?03. Google ScholarDigital Library
Bee-Chung Chen, Lei Chen, Yi Lin, and Raghu Ramakrishnan. Prediction cubes. In VLDB?05. Google ScholarDigital Library
P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR?07. Google ScholarDigital Library
H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE TKDE?03. Google ScholarDigital Library
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational operator generalizing group-by, cross-tab and sub-totals. In ICDE?96. Google ScholarDigital Library
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. In Journal of Machine Learning Research, 2003. Google ScholarDigital Library
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD?96. Google ScholarDigital Library
W. L. Hays. Statistics. CBS College Publishing, New York, NY, 1981.Google Scholar
L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient cube: How to summarize the semantics of a data cube. In VLDB?02. Google ScholarDigital Library
L. V. S. Lakshmanan, J. Pei, and Y. Zhao. QC-Trees: An efficient summary structure for semantic OLAP. In SIGMOD?03. Google ScholarDigital Library
X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cubing approach. In VLDB?04. Google ScholarDigital Library
T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. SIGKDD Explorations, 2004. Google ScholarDigital Library
V. Raman and J. M. Hellerstein. Potter?s wheel: An interactive data cleaning system. In VLDB?01. Google ScholarDigital Library
Y. Sismanis and N. Roussopoulos. The complexity of fully materialized coalesced cubes. In VLDB?04. Google ScholarDigital Library
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann, 2005. Google ScholarDigital Library
D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB?03. Google ScholarDigital Library

Index Terms

Sampling cube: a framework for statistical olap over sampling data
1. Information systems
  1. Information systems applications

Recommendations

Graph cube: on warehousing and OLAP multidimensional networks
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online ...
Read More
AND/OR importance sampling
UAI'08: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence

The paper introduces AND/OR importance sampling for probabilistic graphical models. In contrast to importance sampling, AND/OR importance sampling caches samples in the AND/OR space and then extracts a new sample mean from the stored samples. We prove ...
Read More
Semi-closed cube: an effective approach to trading off data cube size and query response time

The results of data cube will occupy huge amount of disk space when the base table is of a large number of attributes. A new type of data cube, compact data cube like condensed cube and quotient cube, was proposed to solve the problem. It compresses ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
June 2008
1396 pages
ISBN:9781605581026
DOI:10.1145/1376616
General Chairs:
Laks V. S. Lakshmanan
University of British Columbia, Canada
,
Raymond T. Ng
University of British Columbia, Canada
,
Dennis Shasha
New York University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
olap sampling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 747
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sampling cube: a framework for statistical olap over sampling data

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Graph cube: on warehousing and OLAP multidimensional networks

AND/OR importance sampling

Semi-closed cube: an effective approach to trading off data cube size and query response time