Skip to main content
Log in

Resource-aware adaptive indexing for in situ visual exploration and analytics

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In in situ data management scenarios, large data files, which do not fit in main memory, must be efficiently handled using commodity hardware, without the overhead of a preprocessing phase or the loading of data into a database. In this work, we study the challenges posed by the visual analysis tasks in in situ scenarios in the presence of memory constraints. We present an indexing scheme and adaptive query evaluation techniques, which enable efficient categorical-based group-by and filter operations, combined with 2D visual interactions, such as exploration of data points on maps or scatter plots. The indexing scheme combines a tile-based structure, which offers efficient visual exploration over the 2D plane, with a tree-based structure, which organizes a tile’s objects based on its categorical values. The index is constructed on-the-fly, resides in main memory, and is built progressively as the user explores parts of the raw file, whereas its structure and level of granularity are adjusted to the user’s exploration areas and type of analysis. To handle the cases where limited resources are available, we introduce a resource-aware index initialization mechanism, we formulate it as an NP-hard optimization problem and we propose two efficient approximation algorithms to solve it. We conduct extensive experiments using real and synthetic datasets and demonstrate that our approach reports interactive query response times (less than 0.04sec) and in most cases is more than 100\(\times \) faster and performs up to two orders of magnitude less I/O operations compared to existing solutions. The proposed methods are implemented as part of an open-source system for in situ visual exploration and analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. For example, https://www.tutela.com

  2. The source code is available at: http://github.com/VisualFacts/RawVis

  3. We assume that the users are familiar with the schema, the min/max values, and the domains of the attributes in the data file; otherwise, they can have a preview of it, in terms of loading a small sample or parsing the file once.

  4. More than 90% and 75% of the statistics supported by SciPy and Wolfram, respectively, are defined as algebraic aggregate functions [46].

  5. Note that, since several details are omitted, the order of the steps may be different compared to the following paragraphs, where the process is presented in detail. Also, in the implementation, several of these steps are performed in parallel.

  6. Recall that the memory for each node is (almost) the same, with the exception of the leaf nodes where metadata are stored. For simplicity, we assume that all nodes have equal memory size.

  7. The data generator and the queries are available at: github.com/VisualFacts/RawVis

  8. www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

  9. https://github.com/HBPMedical/PostgresRAW

  10. http://dev.mysql.com/doc/refman/8.0/en/csv-storage-engine.html

  11. http://oracle-base.com/articles/12c/external-table-enhancements-12cr1

  12. www.postgresql.org/docs/current/ddl-foreign-data.html

References

  1. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In: European Conference on Computer Systems (EuroSys) (2013)

  2. Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: Nodb: Efficient Query Execution on Raw Data Files. In: ACM Conf on Management of Data (SIGMOD) (2012)

  3. Battle, L., Chang, R., Stonebraker, M.: Dynamic Prefetching of Data Tiles for Interactive Visualization. In: ACM Conf on Management of Data (SIGMOD) (2016)

  4. Bikakis, N., Liagouris, J., Krommyda, M., Papastefanatos, G., Sellis, T.: Towards Scalable Visual Exploration of Very Large Rdf Graphs. In: Extended Semantic Web Conference (ESWC) (2015)

  5. Bikakis, N., Liagouris, J., Krommyda, M., Papastefanatos, G., Sellis, T.: Graphvizdb: A Scalable Platform for Interactive Large Graph Visualization. In: IEEE ICDE (2016)

  6. Bikakis, N., Maroulis, S., Papastefanatos, G., Vassiliadis, P.: RawVis: Visual Exploration over Raw Data. In: Advances in Databases and Information Systems (ADBIS) (2018)

  7. Bikakis, N., Maroulis, S., Papastefanatos, G., Vassiliadis, P.: In-situ Visual Exploration over Big Raw Data. Inform. Sys. 40 (2021)

  8. Bikakis, N., Papastefanatos, G., Skourla, M., Sellis, T.: A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis. Semantic Web Journal (2017)

  9. Blanas, S., Wu, K., Byna, S., Dong, B., Shoshani, A.: Parallel Data Analysis Directly on Scientific File Formats. In: ACM Conf on Management of Data (SIGMOD) (2014)

  10. Cheng, Y., Rusu, F.: SCANRAW: a Database Meta-operator for Parallel In-situ Processing and Loading. ACM TODS 40(3) (2015)

  11. Dar, S., Franklin, M.J., THór Jónsson, B., Srivastava, D., Tan, M.: Semantic Data Caching and Replacement. In: (VLDB) (1996)

  12. El-Hindi, M., Zhao, Z., Binnig, C., Kraska, T.: Vistrees: Fast Indexes for Interactive Data Exploration. In: HILDA (2016)

  13. Fekete, J., Fisher, D., Nandi, A., Sedlmair, M.: Progressive Data Analysis and Visualization (Dagstuhl Seminar 18411). Dagstuhl Reports 8(10) (2018)

  14. Fisher, D., Popov, I.O., Drucker, S.M., Schraefel, M.C.: Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster. In: CHI (2012)

  15. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Min. Knowl. Discov. 1(1) (1997)

  16. Holanda, P., Manegold, S.: Progressive mergesort: Merging batches of appends into progressive indexes. In: Conf on Extending Database Technology (EDBT) (2021)

  17. Holanda, P., Manegold, S., Mühleisen, H., Raasveldt, M.: Progressive Indexes: Indexing for Interactive Data Analysis. PVLDB Endowment 12(13) (2019)

  18. Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here Are My Data Files. Here Are My Queries. Where Are My Results? In: Conf on Innovative Data Systems Research (CIDR) (2011)

  19. Idreos, S., Kersten, M.L., Manegold, S.: Database Cracking. In: Conf on Innovative Data Systems Research (CIDR) (2007)

  20. Ivanova, M., Kersten, M.L., Manegold, S., Kargin, Y.: Data vaults: database technology for scientific file repositories. Comput Sci Eng 15(3) (2013)

  21. Jensen, A.H., Lauridsen, F., Zardbani, F., Idreos, S., Karras, P.: Revisiting multidimensional adaptive indexing [experiment & analysis]. In: Conf on Extending Database Technology (EDBT) (2021)

  22. Jugel, U., Jerzak, Z., Hackenbroich, G., Markl, V.: VDDa: Automatic visualization-driven data aggregation in relational databases. J Very Large Data Bases (VLDBJ) (2015)

  23. Kalinin, A., Çetintemel, U., Zdonik, S.B.: Interactive data exploration using semantic windows. In: ACM SIGMOD (2014)

  24. Karpathiotakis, M., Alagiannis, I., Ailamaki, A.: Fast queries over heterogeneous data through engine customization. PVLDB 9(12) (2016)

  25. Karpathiotakis, M., Branco, M., Alagiannis, I., Ailamaki, A.: Adaptive query processing on raw data. PVLDB 7(12) (2014)

  26. de Lara Pahins, C.A., Stephens, S.A., Scheidegger, C., Comba, J.L.D.: Hashedcubes: Simple, Low Memory, Real-time Visual Exploration of Big Data. IEEE Trans Visualiz Comp Graph 23(1) (2017)

  27. Lins, L.D., Klosowski, J.T., Scheidegger, C.E.: Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans Visualiz Comp Graph 19, 2456–2465 (2013)

    Article  Google Scholar 

  28. Liu, C., Wu, C., Shao, H., Yuan, X.: Smartcube: An adaptive data management architecture for the real-time visualization of spatiotemporal datasets. IEEE TVCG 26(1) (2020)

  29. Maroulis, S., Bikakis, N., Papastefanatos, G., Vassiliadis, P.: RawVis: A System for Efficient In-situ Visual Analytics. In: ACM Conf on Management of Data (SIGMOD) (2021)

  30. Maroulis, S., Bikakis, N., Papastefanatos, G., Vassiliadis, P., Vassiliou, Y.: Adaptive indexing for in-situ visual exploration and analytics. In:DOLAP Workshop (2021)

  31. Miranda, F., Lins, L., Klosowski, J.T., Silva, C.T.: TopKube: A Rank-Aware Data Cube for Real-Time Exploration of Spatiotemporal Data. IEEE TVCG 24, (2017)

  32. Morton, K., Balazinska, M., Grossman, D., Mackinlay, J.D.: Support the Data Enthusiast: Challenges for Next-generation Data-analysis Systems. VLDB Endowment 7(6) (2014)

  33. Nathan, V., Ding, J., Alizadeh, M., Kraska, T.: Learning multi-dimensional indexes. In: ACM SIGMOD (2020)

  34. Nerone, M., Holanda, P., de Almeida, E.C., Manegold, S.: Multidimensional Adaptive and Progressive Indexes. In: IEEE Conf on Data Engineering (ICDE) (2021)

  35. Olma, M., Karpathiotakis, M., Alagiannis, I., Athanassoulis, M., Ailamaki, A.: Slalom: Coasting through Raw Data Via Adaptive Partitioning and Indexing. VLDB Endowment 10(10) (2017)

  36. Olma, M., Karpathiotakis, M., Alagiannis, I., Athanassoulis, M., Ailamaki, A.: Adaptive partitioning and indexing for in situ query processing. J Very Large Data Bases (VLDBJ) (2019)

  37. Papastefanatos, G., Alexiou, G., Bikakis, N., Maroulis, S., Stamatopoulos, V.: Visualfacts: A platform for in-situ visual exploration and real-time entity resolution. In: Workshop on Big Data Visual Exploration & Analytics (BigVis) (2022)

  38. Pavlovic, M., Sidlauskas, D., Heinis, T., Ailamaki, A.: QUASII: query-aware spatial incremental index. In: Conf on Extending Database Technology (EDBT) (2018)

  39. Rahman, P., Jiang, L., Nandi, A.: Evaluating interactive data systems. J Very Large Data Bases (VLDBJ) 29(1) (2020)

  40. Rahman, S., Aliakbarpour, M., Kong, H., Blais, E., Karahalios, K., Parameswaran, A.G., Rubinfeld, R.: I’ve Seen “Enough”: incrementally improving visualizations to support rapid decision making. VLDB Endowment 10(11) (2017)

  41. Richter, S., Quiané-Ruiz, J., Schuh, S., Dittrich, J.: Towards zero-overhead static and adaptive indexing in Hadoop. J Very Large Data Bases (VLDBJ) 23(3) (2014)

  42. Tao, W., Liu, X., Wang, Y., Battle, L., Demiralp, Ç., Chang, R., Stonebraker, M.: Kyrix: Interactive pan/zoom visualizations at scale. Comput. Graph. Forum 38(3) (2019)

  43. Tauheed, F., Heinis, T., Schürmann, F., Markram, H., Ailamaki, A.: SCOUT: Prefetching for Latent Feature Following Queries. VLDB Endow 5(11) (2012)

  44. Tian, Y., Alagiannis, I., Liarou, E., Ailamaki, A., Michiardi, P., Vukolic, M.: Dinodb: An Interactive-speed Query Engine for Ad-hoc Queries on Temporary Data. IEEE Trans Big Data (2017)

  45. Wang, Z., Ferreira, N., Wei, Y., Bhaskar, A.S., Scheidegger, C.: Gaussian cubes: Real-time modeling for visual exploration of large multidimensional datasets. IEEE Trans Visualiz Comp Graph 23(1) (2017)

  46. Wasay, A., Wei, X., Dayan, N., Idreos, S.: Data canopy: accelerating exploratory statistical analysis. In: ACM SIGMOD (2017)

  47. Yesilmurat, S., Isler, V.: Retrospective adaptive prefetching for interactive Web GIS applications. GeoInformatica 16(3) (2012)

  48. Zardbani, F., Afshani, P., Karras, P.: Revisiting the theory and practice of database cracking. In: EDBT (2020)

  49. Zhao, W., Rusu, F., Dong, B., Wu, K., Ho, A.Y.Q., Nugent, P.: Distributed caching for processing raw arrays. In: Conf on Scientific & Statistical Database Management (SSDBM) (2018)

Download references

Acknowledgements

This work was funded by the project VisualFacts (#1614 - 1st Call of the Hellenic Foundation for Research and Innovation Research Projects for the support of postdoctoral researchers).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stavros Maroulis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maroulis, S., Bikakis, N., Papastefanatos, G. et al. Resource-aware adaptive indexing for in situ visual exploration and analytics. The VLDB Journal 32, 199–227 (2023). https://doi.org/10.1007/s00778-022-00739-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00739-z

Keywords

Navigation