Abstract
The improvement of data storage and data acquisition techniques has led to huge accumulated data volumes in a variety of applications. International research enterprises such as the Human Genome and the Digital Sky Survey Projects are generating massive volumes of scientific data. A major challenge with these datasets is to glean insights from them to discover patterns or to originate relationships. The analysis of these massive, typically messy, and inconsistent volumes of data is indeed crucial and challenging in many application domains. Hence, the research community has introduced a number of visualizations tools to guide and help analysts in exploring the data space to extract potentially useful information. However, when working with high-dimensional datasets, identifying visualizations that show interesting variations and trends in data is not trivial: the analyst must manually specify a large number of visualizations, explore relationships among various attributes, and examine different subsets of data before discovering visualizations that are interesting or insightful. Though, exploring all possible visualizations involves complex challenges. It is a costly and time-consuming process especially when the dimensionality is high. Furthermore, the rapid growth of databases becomes multifaceted in their channels and dimensionality; thus, the transition from static analysis to real-time analytics represents a fundamental paradigm shift in the field of Big Data. Motivated by the above challenges, we propose an efficient framework called real-time scoring engine (RtSEngine) that assists analysts to limit the exploration of visualizations for a specified number of visualizations and/or certain execution time quote to recommend a set of visualizations that meet analysts’ budgets. To achieve that, RtSEngine incorporates our proposed approaches to prioritize and score attributes that form all possible visualizations in a dataset based on their statistical properties such as selectivity, data distribution, and number of distinct values. Then, RtSEngine recommends the visualizations created from the top-scored attributes. Moreover, we present visualizations cost-aware techniques that estimate the retrieval and computation costs of each visualization so that analysts may discard high-cost visualizations. We show and evaluate the effectiveness and efficiency of our proposed approaches, and asses the quality of visualizations and the overhead obtained by applying our techniques on both synthetic and real datasets.
Similar content being viewed by others
Notes
Implementations and data are available at: https://github.com/ibrahimDKE/Cdb_RtsEngine_DKE_UQ.
References
Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The New Jersey data reduction report. IEEE Data Eng Bull 20(4):3–45
Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, pp 258–265. http://jmlr.org/proceedings/papers/v28/bubeck13.html
Charikar M, Chaudhuri S, Motwani R, Narasayya VR (2000) Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, 15–17 May 2000, pp 268–279
Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, ACM, New York, NY, USA, pp 34–43. doi:10.1145/275487.275492
Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: How much is enough? In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, 2–4 June 1998, pp 436–447
Fisher D (2007) Hotmap: looking at geographic attention. IEEE Trans Vis Comput Graph 13(6):1184–1191
Getoor L, Taskar B, Koller D (2001) Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, ACM, New York, NY, USA, pp 461–472. doi:10.1145/375663.375727
Gilbert AC, Kotidis Y, Muthukrishnan S, Strauss M (2001) Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA, USA, 21–23 May 2001
Gonzalez H, Halevy AY, Jensen CS, Langen A, Madhavan J, Shapley R, Shen W, Goldberg-Kidon J (2010) Google fusion tables: web-centered data management and collaboration. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, IN, USA, 6–10 June 2010, pp 1061–1066
Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, Tucson, AZ, USA, 13–15 May 1997, pp 171–182
Holzinger A, Simonic K (eds.) (2011) Information Quality in e-Health - 7th Conference of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society, USAB 2011, Graz, Austria, 25–26 Nov 2011, Lecture Notes in Computer Science, vol 7058. Springer
Hou WC, Ozsoyoglu G (1991) Statistical estimators for aggregate relational algebra queries. ACM Trans Database Syst 16(4):600–654. doi:10.1145/115302.115300
Hund M, Böhm D, Sturm W, Sedlmair M, Schreck T, Ullrich T, Keim DA, Majnaric L, Holzinger A (2016) Visual analytics for concept exploration in subspaces of patient groups. Brain Inf 1–15. doi:10.1007/s40708-016-0043-5
Ioannidis Y (2003) The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pp 19–30. VLDB Endowment. http://dl.acm.org/citation.cfm?id=1315451.1315455
Jagadish HV (1999) Review—explaining differences in multidimensional aggregates. ACM SIGMOD Digit Rev 1:1–11
Jang MH, Kim SW, Faloutsos C, Park S (2011) A linear-time approximation of the earth mover’s distance. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, ACM, New York, NY, USA, pp 505–514. doi:10.1145/2063576.2063652
Jugel U, Jerzak Z, Hackenbroich G, Markl V (2016) VDDA: automatic visualization-driven data aggregation in relational databases. VLDB J 25(1):53–77. doi:10.1007/s00778-015-0396-z
Kandel S, Parikh R, Paepcke A, Hellerstein JM, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, pp 547–554
Key A, Howe B, Perry D, Aragon CR (2012) Vizdeck: self-organizing dashboards for visual analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp 681–684
Lahdenmaki T, Leach M (2005) Relational database index design and the optimizers. Wiley–Interscience
Lipton RJ, Naughton JF, Schneider DA (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD ’90, ACM, New York, NY, USA, pp 1–11. doi:10.1145/93597.93611
Livny M, Ramakrishnan R, Beyer KS, Chen G, Donjerkovic D, Lawande S, Myllymaki J, Wenger RK (1997) Devise: integrated querying and visualization of large datasets. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data,Tucson, AZ, USA, 13–15 May 1997, pp 301–312
Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for local queries. SIGMOD Rec 15(2):84–95. doi:10.1145/16856.16863
Mackinlay JD, Hanrahan P, Stolte C (2007) Show me: automatic presentation for visual analysis. IEEE Trans Vis Comput Graph 13(6):1137–1144
Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221
Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, 8–21 June 1984, pp 256–276
Sarawagi S (2000) User-adaptive exploration of multidimensional data. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, 10–14 Sept 2000, pp 307–316
Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, 11–14 Sept 20011, pp 531–540
Serfling RJ (1974) Probability inequalities for the sum in sampling without replacement. Ann Stat 2(1):39–48. doi:10.1214/aos/1176342611
Stillger M, Lohman GM, Markl V, Kandil M (2001) Leo - db2’s learning optimizer. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 19–28. http://dl.acm.org/citation.cfm?id=645927.672349
Stolte C, Hanrahan P Polaris A (2000) system for query, analysis and visualization of multi-dimensional relational databases. In: Proceedings of the IEEE Symposium on Information Vizualization 2000, INFOVIS ’00, IEEE Computer Society, Washington, DC, USA, p 5. http://dl.acm.org/citation.cfm?id=857190.857686
Vartak M, Madden S, Parameswaran A, Polyzotis N Seedb: towards automatic query result visualizations. Tech. rep., Technical Report, data-people. cs. illinois. edu/seedb-tr. pdf
Vartak M, Madden S, Parameswaran AG, Polyzotis N (2014) SEEDB: automatically generating query visualizations. PVLDB 7(13):1581–1584
Wong BLW, Chen R, Kodagoda N, Rooney C, Xu K (2011) INVISQUE: intuitive information exploration through interactive visualization. In: Proceedings of the International Conference on Human Factors in Computing Systems, CHI 2011, Extended Abstracts Volume, Vancouver, BC, Canada, 7–12 May 2011, pp 311–316. doi:10.1145/1979742.1979720
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ibrahim, I.A., Albarrak, A.M. & Li, X. Constrained recommendations for query visualizations. Knowl Inf Syst 51, 499–529 (2017). https://doi.org/10.1007/s10115-016-1001-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-1001-5