A Metadata Diagnostic Framework for a New Approximate Query Engine Working with Granulated Data Summaries

Chądzyńska-Krasowska, Agnieszka; Stawicki, Sebastian; Ślęzak, Dominik

doi:10.1007/978-3-319-60837-2_50

Agnieszka Chądzyńska-Krasowska²⁰,
Sebastian Stawicki²¹ &
Dominik Ślęzak²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10313))

Included in the following conference series:

International Joint Conference on Rough Sets

1069 Accesses
2 Citations

Abstract

This paper refers to a new database engine that acquires and utilizes granulated data summaries for the purposes of fast approximate execution of analytical SQL statements. We focus on the task of creation of a relational metadata repository which enables the engine developers and users to investigate the collected data summaries independently from the engine itself. We discuss how the design of the considered repository evolved over time from both conceptual and software engineering perspectives, addressing the challenges of conversion and accessibility of the internal engine contents that can represent hundreds of terabytes of the original data. We show some scenarios of a usage of the obtained metadata repository for both diagnostic and analytical purposes. We pay a particular attention to the relationships of the discussed scenarios with the principles of rough sets – one of the theories that hugely influenced the presented solutions. We also report some empirical results obtained for relatively small fragments (\(100 \times 2^{16}\) rows each) of data sets coming from two organizations that use the considered new engine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
One of the current deployments of the considered new engine assumes working with 30-day periods, wherein there are over 10 billions of new data rows coming every day and ad-hoc analytical queries are required to execute in 2 s.
2.
Formerly known as Brighthouse and Infobright Community/Enterprise Edition.
3.
https://pypi.python.org/pypi/matplotlib.
4.
https://pypi.python.org/pypi/lxml.
5.
https://pypi.python.org/pypi/pandas.

References

Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)
Google Scholar
Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
MATH Google Scholar
Pawlak, Z., Skowron, A.: Rough sets: some extensions. Inf. Sci. 177(1), 28–40 (2007)
Article MathSciNet Google Scholar
Ślęzak, D., Synak, P., Wojna, A., Wróblewski, J.: Two database related interpretations of rough approximations: data organization and query execution. Fund. Inf. 127(1–4), 445–459 (2013)
Google Scholar
Nguyen, H.S.: Approximate boolean reasoning: foundations and applications in data mining. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 334–506. Springer, Heidelberg (2006). doi:10.1007/11847465_16
Chapter Google Scholar
Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall, Upper Saddle River (2003)
Google Scholar
Chądzyńska-Krasowska, A., Kowalski, M.: Quality of histograms as indicator of approximate query quality. In: Proceedings of FedCSIS 2016, pp. 9–15 (2016)
Google Scholar
Kimball, R.: The Data Warehouse Lifecycle Toolkit. Wiley, Hoboken (2008)
Google Scholar
Pagani, I., Liolios, K., Jansson, J., Chen, I.A., Smirnova, T., Nosrat, B., Markowitz, V.M., Kyrpides, N.: The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40(Database–Issue), 571–579 (2012)
Article Google Scholar
Chądzyńska-Krasowska, A., Betliński, P., Ślęzak, D.: Scalable machine learning with granulated data summaries: a case of feature selection. In: Proceedings of ISMIS 2017 (2017)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Ganter, B., Meschke, C.: A formal concept analysis approach to rough data tables. In: Peters, J.F., Skowron, A., Sakai, H., Chakraborty, M.K., Slezak, D., Hassanien, A.E., Zhu, W. (eds.) Transactions on Rough Sets XIV. LNCS, vol. 6600, pp. 37–61. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21563-6_3
Chapter MATH Google Scholar

Download references

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, Koszykowa 86, 02-008, Warsaw, Poland
Agnieszka Chądzyńska-Krasowska
Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland
Sebastian Stawicki & Dominik Ślęzak

Authors

Agnieszka Chądzyńska-Krasowska
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stawicki
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Ślęzak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Ślęzak .

Editor information

Editors and Affiliations

Polish-Japanese Academy of Information Technology, Warsaw, Poland
Lech Polkowski
University of Regina, Regina, SK, Canada
Yiyu Yao
University of Warmia and Mazury, Olsztyn, Poland
Piotr Artiemjew
University of Milano-Bicocca, Milano, Italy
Davide Ciucci
Southwest Jiaotong University, Chengdu, China
Dun Liu
Warsaw University, Warszawa, Poland
Dominik Ślęzak
Silesian University, Sosnowiec, Poland
Beata Zielosko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chądzyńska-Krasowska, A., Stawicki, S., Ślęzak, D. (2017). A Metadata Diagnostic Framework for a New Approximate Query Engine Working with Granulated Data Summaries. In: Polkowski, L., et al. Rough Sets. IJCRS 2017. Lecture Notes in Computer Science(), vol 10313. Springer, Cham. https://doi.org/10.1007/978-3-319-60837-2_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-60837-2_50
Published: 22 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60836-5
Online ISBN: 978-3-319-60837-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics