research-article

Extracting Top-K Insights from Multi-dimensional Data

Authors:
Bo Tang

The Hong Kong Polytechnic University & Microsoft Research, Hong Kong, Hong Kong

The Hong Kong Polytechnic University & Microsoft Research, Hong Kong, Hong Kong
View Profile

,
Shi Han

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Man Lung Yiu

The Hong Kong Polytechnic University, Hong Kong, Hong Kong

The Hong Kong Polytechnic University, Hong Kong, Hong Kong
View Profile

,
Rui Ding

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Dongmei Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 1509–1524https://doi.org/10.1145/3035918.3035922

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 1509–1524

ABSTRACT

OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

References

Ibm cogons. https://goo.gl/6dYxLc.Google Scholar
Ibm watson analytics. http://goo.gl/EK1nNU.Google Scholar
Quick insights in microsoft power bi. https://goo.gl/xHwCLg.Google Scholar
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In ACM Sigmod Record, 2001. Google ScholarDigital Library
C. Anderson. The long tail: Why the future of business is selling more for less. Hyperion, 2008. Google ScholarDigital Library
N. Balakrishnan. Handbook of the logistic distribution. CRC Press, 2013.Google Scholar
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 2013. Google ScholarDigital Library
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube. In SIGMOD Record, 1999. Google ScholarDigital Library
S. Chaudhuri. What next?: a half-dozen data management research goals for big data and the cloud. In PODS, 2012. Google ScholarDigital Library
S. Chaudhuri, U. Dayal, and V. Narasayya. An overview of business intelligence technology. Communications of the ACM, 2011. Google ScholarDigital Library
D. Dash, J. Rao, N. Megiddo, A. Ailamaki, and G. Lohman. Dynamic faceted search for discovery-driven analysis. In CIKM, 2008. Google ScholarDigital Library
K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Explore-by-example: An automatic query steering framework for interactive data exploration. In SIGMOD, 2014. Google ScholarDigital Library
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In VLDB, 1999. Google ScholarDigital Library
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1997. Google ScholarDigital Library
S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration techniques. In SIGMOD, 2015. Google ScholarDigital Library
I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-phk query processing techniques in relational database systems. ACM Comput. Surv., 2008. Google ScholarDigital Library
M. Krzywinski and N. Altman. Points of significance: Significance, p values and t-tests. Nature methods, 2013.Google Scholar
C. Li, B. C. Ooi, A. K. Tung, and S. Wang. Dada: a data cube for dominant relationship analysis. In SIGMOD, 2006. Google ScholarDigital Library
X. Li, J. Han, Z. Yin, J.-G. Lee, and Y. Sun. Sampling cube: a framework for statistical olap over sampling data. In SIGMOD, 2008. Google ScholarDigital Library
P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson. Extracting insights from the shape of complex data using topology. Scientific Reports, 2013.Google ScholarCross Ref
R. S. Michalski. A theory and methodology of inductive learning. In Machine learning. Springer, 1983.Google ScholarCross Ref
E. A. Müller. Efficient knowledge discovery in subspaces of high dimensional databases. PhD thesis, RWTH Aachen University, 2010.Google Scholar
S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, 1999. Google ScholarDigital Library
S. Sarawagi. User-adaptive exploration of multidimensional data. In VLDB, 2000.Google Scholar
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In EDBT, 1998. Google ScholarDigital Library
S. Sarawagi and G. Sathe. i3: intelligent, interactive investigation of olap data cubes. In ACM SIGMOD Record, 2000. Google ScholarDigital Library
T. Sellam and M. L. Kersten. Meet charles, big data query advisor. In CIDR, 2013.Google Scholar
T. Sellam, E. Müller, and M. L. Kersten. Semi-automated exploration of data warehouses. In CIKM, 2015. Google ScholarDigital Library
R. H. Shumway and D. S. Stoffer. Time series analysis and its applications: with R examples. Springer Science & Business Media, 2010.Google Scholar
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 2015. Google ScholarDigital Library
A. Wasay, M. Athanassoulis, and S. Idreos. Queriosity: Automated data exploration. In IEEE Congress on Big Data, 2015. Google ScholarDigital Library
P. Wu, Y. Sismanis, and B. Reinwald. Towards keyword-driven analytical processing. In SIGMOD, 2007. Google ScholarDigital Library
T. Wu, D. Xin, and J. Han. Arcube: supporting ranking aggregate queries in partially materialized data cubes. In SIGMOD, 2008. Google ScholarDigital Library
T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multi-dimensional space. PVLDB, 2009. Google ScholarDigital Library
D. Xin, J. Han, X. Li, Z. Shao, and B. W. Wah. Computing iceberg cubes by top-down and bottom-up integration: The starcubing approach. TKDE, 2007. Google ScholarDigital Library

Index Terms

Extracting Top-K Insights from Multi-dimensional Data
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data warehouses
2. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Exploratory data analysis

Recommendations

Answering Multi-Dimensional Analytical Queries under Local Differential Privacy
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Multi-dimensional analytical (MDA) queries are often issued against a fact table with predicates on (categorical or ordinal) dimensions and aggregations on one or more measures. In this paper, we study the problem of answering MDA queries under local ...
Read More
Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which already exist ...
Read More
Multiple Decisional Query Optimization in Big Data Warehouse

Data warehousing DW area has always motivated a plethora of hard optimization problem that cannot be solved in polynomial time. Those optimization problems are more complex and interesting when it comes to multiple OLAP queries. In this article, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data exploration
insight extraction
olap
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 74
  Total Citations
  View Citations
- 1,447
  Total Downloads
- Downloads (Last 12 months)109
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting Top-K Insights from Multi-dimensional Data

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Answering Multi-Dimensional Analytical Queries under Local Differential Privacy

Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

Multiple Decisional Query Optimization in Big Data Warehouse