Skip to main content
Log in

Constrained recommendations for query visualizations

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The improvement of data storage and data acquisition techniques has led to huge accumulated data volumes in a variety of applications. International research enterprises such as the Human Genome and the Digital Sky Survey Projects are generating massive volumes of scientific data. A major challenge with these datasets is to glean insights from them to discover patterns or to originate relationships. The analysis of these massive, typically messy, and inconsistent volumes of data is indeed crucial and challenging in many application domains. Hence, the research community has introduced a number of visualizations tools to guide and help analysts in exploring the data space to extract potentially useful information. However, when working with high-dimensional datasets, identifying visualizations that show interesting variations and trends in data is not trivial: the analyst must manually specify a large number of visualizations, explore relationships among various attributes, and examine different subsets of data before discovering visualizations that are interesting or insightful. Though, exploring all possible visualizations involves complex challenges. It is a costly and time-consuming process especially when the dimensionality is high. Furthermore, the rapid growth of databases becomes multifaceted in their channels and dimensionality; thus, the transition from static analysis to real-time analytics represents a fundamental paradigm shift in the field of Big Data. Motivated by the above challenges, we propose an efficient framework called real-time scoring engine (RtSEngine) that assists analysts to limit the exploration of visualizations for a specified number of visualizations and/or certain execution time quote to recommend a set of visualizations that meet analysts’ budgets. To achieve that, RtSEngine incorporates our proposed approaches to prioritize and score attributes that form all possible visualizations in a dataset based on their statistical properties such as selectivity, data distribution, and number of distinct values. Then, RtSEngine recommends the visualizations created from the top-scored attributes. Moreover, we present visualizations cost-aware techniques that estimate the retrieval and computation costs of each visualization so that analysts may discard high-cost visualizations. We show and evaluate the effectiveness and efficiency of our proposed approaches, and asses the quality of visualizations and the overhead obtained by applying our techniques on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Implementations and data are available at: https://github.com/ibrahimDKE/Cdb_RtsEngine_DKE_UQ.

  2. http://www.transtats.bts.gov/.

References

  1. Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The New Jersey data reduction report. IEEE Data Eng Bull 20(4):3–45

    Google Scholar 

  2. Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, pp 258–265. http://jmlr.org/proceedings/papers/v28/bubeck13.html

  3. Charikar M, Chaudhuri S, Motwani R, Narasayya VR (2000) Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, 15–17 May 2000, pp 268–279

  4. Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, ACM, New York, NY, USA, pp 34–43. doi:10.1145/275487.275492

  5. Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: How much is enough? In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, 2–4 June 1998, pp 436–447

  6. Fisher D (2007) Hotmap: looking at geographic attention. IEEE Trans Vis Comput Graph 13(6):1184–1191

    Article  Google Scholar 

  7. Getoor L, Taskar B, Koller D (2001) Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, ACM, New York, NY, USA, pp 461–472. doi:10.1145/375663.375727

  8. Gilbert AC, Kotidis Y, Muthukrishnan S, Strauss M (2001) Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA, USA, 21–23 May 2001

  9. Gonzalez H, Halevy AY, Jensen CS, Langen A, Madhavan J, Shapley R, Shen W, Goldberg-Kidon J (2010) Google fusion tables: web-centered data management and collaboration. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, IN, USA, 6–10 June 2010, pp 1061–1066

  10. Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, Tucson, AZ, USA, 13–15 May 1997, pp 171–182

  11. Holzinger A, Simonic K (eds.) (2011) Information Quality in e-Health - 7th Conference of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society, USAB 2011, Graz, Austria, 25–26 Nov 2011, Lecture Notes in Computer Science, vol 7058. Springer

  12. Hou WC, Ozsoyoglu G (1991) Statistical estimators for aggregate relational algebra queries. ACM Trans Database Syst 16(4):600–654. doi:10.1145/115302.115300

    Article  Google Scholar 

  13. Hund M, Böhm D, Sturm W, Sedlmair M, Schreck T, Ullrich T, Keim DA, Majnaric L, Holzinger A (2016) Visual analytics for concept exploration in subspaces of patient groups. Brain Inf 1–15. doi:10.1007/s40708-016-0043-5

  14. Ioannidis Y (2003) The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pp 19–30. VLDB Endowment. http://dl.acm.org/citation.cfm?id=1315451.1315455

  15. Jagadish HV (1999) Review—explaining differences in multidimensional aggregates. ACM SIGMOD Digit Rev 1:1–11

  16. Jang MH, Kim SW, Faloutsos C, Park S (2011) A linear-time approximation of the earth mover’s distance. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, ACM, New York, NY, USA, pp 505–514. doi:10.1145/2063576.2063652

  17. Jugel U, Jerzak Z, Hackenbroich G, Markl V (2016) VDDA: automatic visualization-driven data aggregation in relational databases. VLDB J 25(1):53–77. doi:10.1007/s00778-015-0396-z

  18. Kandel S, Parikh R, Paepcke A, Hellerstein JM, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, pp 547–554

  19. Key A, Howe B, Perry D, Aragon CR (2012) Vizdeck: self-organizing dashboards for visual analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp 681–684

  20. Lahdenmaki T, Leach M (2005) Relational database index design and the optimizers. Wiley–Interscience

  21. Lipton RJ, Naughton JF, Schneider DA (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD ’90, ACM, New York, NY, USA, pp 1–11. doi:10.1145/93597.93611

  22. Livny M, Ramakrishnan R, Beyer KS, Chen G, Donjerkovic D, Lawande S, Myllymaki J, Wenger RK (1997) Devise: integrated querying and visualization of large datasets. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data,Tucson, AZ, USA, 13–15 May 1997, pp 301–312

  23. Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for local queries. SIGMOD Rec 15(2):84–95. doi:10.1145/16856.16863

    Article  Google Scholar 

  24. Mackinlay JD, Hanrahan P, Stolte C (2007) Show me: automatic presentation for visual analysis. IEEE Trans Vis Comput Graph 13(6):1137–1144

    Article  Google Scholar 

  25. Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221

    Article  MATH  Google Scholar 

  26. Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, 8–21 June 1984, pp 256–276

  27. Sarawagi S (2000) User-adaptive exploration of multidimensional data. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, 10–14 Sept 2000, pp 307–316

  28. Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, 11–14 Sept 20011, pp 531–540

  29. Serfling RJ (1974) Probability inequalities for the sum in sampling without replacement. Ann Stat 2(1):39–48. doi:10.1214/aos/1176342611

  30. Stillger M, Lohman GM, Markl V, Kandil M (2001) Leo - db2’s learning optimizer. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 19–28. http://dl.acm.org/citation.cfm?id=645927.672349

  31. Stolte C, Hanrahan P Polaris A (2000) system for query, analysis and visualization of multi-dimensional relational databases. In: Proceedings of the IEEE Symposium on Information Vizualization 2000, INFOVIS ’00, IEEE Computer Society, Washington, DC, USA, p 5. http://dl.acm.org/citation.cfm?id=857190.857686

  32. Vartak M, Madden S, Parameswaran A, Polyzotis N Seedb: towards automatic query result visualizations. Tech. rep., Technical Report, data-people. cs. illinois. edu/seedb-tr. pdf

  33. Vartak M, Madden S, Parameswaran AG, Polyzotis N (2014) SEEDB: automatically generating query visualizations. PVLDB 7(13):1581–1584

    Google Scholar 

  34. Wong BLW, Chen R, Kodagoda N, Rooney C, Xu K (2011) INVISQUE: intuitive information exploration through interactive visualization. In: Proceedings of the International Conference on Human Factors in Computing Systems, CHI 2011, Extended Abstracts Volume, Vancouver, BC, Canada, 7–12 May 2011, pp 311–316. doi:10.1145/1979742.1979720

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ibrahim A. Ibrahim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ibrahim, I.A., Albarrak, A.M. & Li, X. Constrained recommendations for query visualizations. Knowl Inf Syst 51, 499–529 (2017). https://doi.org/10.1007/s10115-016-1001-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-1001-5

Keywords

Navigation