HIFUN - a high level functional query language for big data analytics

Spyratos, Nicolas; Sugibuchi, Tsuyoshi

doi:10.1007/s10844-018-0495-6

HIFUN - a high level functional query language for big data analytics

Published: 06 February 2018

Volume 51, pages 529–555, (2018)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Nicolas Spyratos¹ &
Tsuyoshi Sugibuchi²

361 Accesses
9 Citations
Explore all metrics

Abstract

We present a high level query language, called HIFUN, for defining analytic queries over big datasets, independently of how these queries are evaluated. An analytic query in HIFUN is defined to be a well-formed expression of a functional algebra that we define in the paper. The operations of this algebra combine functions to create HIFUN queries in much the same way as the operations of the relational algebra combine relations to create algebraic queries. The contributions of this paper are: (a) the definition of a formal framework in which to study analytic queries in the abstract; (b) the encoding of a HIFUN query either as a MapReduce job or as an SQL group-by query; and (c) the definition of a formal method for rewriting HIFUN queries and, as a case study, its application to the rewriting of MapReduce jobs and of SQL group-by queries. We emphasize that, although theoretical in nature, our work uses only basic and well known mathematical concepts, namely functions and their basic operations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

References

Abadi, D., Boncz, P., Harizopoulos, S., Idreos, S., Madden, S., et al. (2013). The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 197–280.
Article Google Scholar
Apache Software Foundation. (2017). Apache project list - category: Big data https://projects.apache.org/projects.html?category.
Buneman, P., & Frankel, R.E. (1979). Fql: A functional query language. In Proceedings of ACM SIGMOD international conference on management of data. https://doi.org/10.1145/582095.582104 (pp. 52–58). New York: ACM.
Cattell, R. (2011). Scalable sql and nosql data stores. SIGMOD Record, 39 (4), 12–27. https://doi.org/10.1145/1978915.1978919.
Article Google Scholar
Cutting, D., & Cafarella, M. (2005). Hadoop http://hadoop.apache.org/.
Drineas, P., & Huo, X. (2016). Theoretical foundations of data science http://www.cs.rpi.edu/TFoDS/.
Gudivada, V.N., Rao, D., & Raghavan, V.V. (2014). Nosql systems for big data management. In 2014 IEEE World congress on services (pp. 190–197).
Halevy, A.Y. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. https://doi.org/10.1007/s007780100054.
Article MATH Google Scholar
House, W. (2012). Big data across the federal government http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdf.
House, W. (2014). Big data: Seizing opportunities, preserving values http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf.
Maier, D. (1983). Theory of relational databases. New York: Computer Science Press.
MATH Google Scholar
Pierce, B.C. (1991). Basic category theory for computer scientists. Cambridge: MIT Press.
MATH Google Scholar
Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: a brief guide to the emerging world of polyglot persistence. Boston: Addison-Wesley Professional.
Google Scholar
Ślezak, D., Glick, R., Betliński, P., & Synak, P. (2017). A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries. Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-017-0471-6.
Article Google Scholar
Spyratos, N. (2006). A functional model for data analysis. In Proceedings of the 7th international conference on flexible query answering systems (pp. 51–64). Berlin: FQAS’06, Springer.
Chapter Google Scholar
Spyratos, N., & Sugibuchi, T. (2013). Restrict-reduce: Parallelism and rewriting for big data processing. In Tanaka, Y., Spyratos, N., Yoshida, T., & Meghini, C. (Eds.) Information search, integration and personalization, communications in computer and information science. https://doi.org/10.1007/978-3-642-40140-4_2, (Vol. 146 pp. 11–20). Berlin : Springer.
Google Scholar
Spyratos, N., & Sugibuchi, T. (2014). A high level query language for big data analytics, Tech. Rep. 1575, CNRS – Université Paris Sud – LRI, Orsay. https://www.lri.fr/bibli/Rapports-internes/2014/RR1575.pdf.
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., & Rasin, A. (2010). Mapreduce and parallel dbmss: Friends or foes? Communications of the ACM, 53(1), 64–71. https://doi.org/10.1145/1629175.1629197.
Article Google Scholar
Vidal, M., Raschid, L., Marquez, N., Cardenas, M., & Wu, Y. (2006). Query rewriting in the semantic web. In Proceedings 22nd international conference on data engineering workshops, 2006 (pp. 7–7).
Zaharia, M. (2016). An architecture for fast and general data processing on large clusters. Claypool: Association for Computing Machinery and Morgan &#38.

Download references

Author information

Authors and Affiliations

Laboratoire de Recherche en Informatique, Université Paris-Sud 11, 15 Rue Georges Clemenceau, 91400, Orsay, France
Nicolas Spyratos
OppScience, 14 Avenue Trudaine, 75009, Paris, France
Tsuyoshi Sugibuchi

Authors

Nicolas Spyratos
View author publications
You can also search for this author in PubMed Google Scholar
Tsuyoshi Sugibuchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Spyratos.

Additional information

Work conducted while the first author was visiting at FORTH Institute of Computer Science, Crete, Greece (https://www.ics.forth.gr/)

Appendix

Proof Proof of Proposition 1

Let A and B be two nodes of $\mathcal {C}$,and suppose there is a cycle on A consisting of two functions:f : A → B and g : B → A.Suppose that there is some element a in A such thatg ∘ f(a) = a^′and a≠a^′.This implies that a^′depends on a, a contradiction to our assumption that the nodes of$\mathcal {C}$represent sets of independent values. Therefore the only possibility to have a cycle on A is wheng ∘ f(a) = a for all a in A; in other words the only possible cycle on A isι_A,where ι_Ais the identity function on A. □

Proof Proof: Proposition 2

Observe first that, as ans_Qis a function with source B, the query Q^″is well formed, that is, its grouping function g and its measuring functionans_Qhave the same source (namely B), and op is an operation on the target ofans_Q.

Let c ∈ C.It follows from well known properties of functions that:

ifg^− 1(c) = {b₁,…,b_k}then$(g \circ f)^{-1}(c) = f^{-1}(b_{1}) \bigcup {\ldots } \bigcup f^{-1}(b_{k})$

From our definition of answer, we have:

$ans_{Q^{\prime }}(c) = red(m/(g \circ f)^{-1}(c), op)$

As op is a distributive operation and the family{f^− 1(b₁),…,f^− 1(b_k)}is a partition of (g ∘ f)^− 1(c),we have:

$$\begin{array}{@{}rcl@{}} ans_{Q^{\prime}}(c) &=& red(m/(g \circ f)^{-1}(c), op) \\ &=& op(red(m/f^{-1}(b_{1}), op), \ldots, red(m/f^{-1}(b_{k}), op)) \\ &=& op(ans_{Q}(b_{1}), \ldots, ans_{Q}(b_{k})) [because red(m/f^{-1}(b_{i}), op))= ans_{Q^{\prime}}(b_{i})\\ &=& red(m/g^{-1}(c), op) \\ &=& ans_{Q^{\prime\prime}}(c) \end{array} $$

Therefore $ans_{Q^{\prime }}(c) = ans_{Q^{\prime \prime }}(c)$for all c in C and this concludes the proof. □

Proof Proof of Proposition 3.1

Suppose first that there is a function h : A → B such that g = h ∘ f.Then for all a, a^′in A such that f(a) = f(a^′)we have: g(a) = h(f(a)) = h(f(a^′)) = g(a^′).Therefore f ≤ g.Suppose next that f ≤ g.It follows that π_f ≤ π_g.Consider now any b in the range of f and define:h(b) = g(f^− 1(b)).As f ≤ g,the block f^− 1(b)of π_fis included in some block of π_g,say g^− 1(c),where c is in the range of g. It follows that:g(f^− 1(b)) = c,therefore h is a well defined function over the rangeof f . Moreover, from the definition of h we have:h(f(a)) = g(f^− 1(b)) = g(b).Therefore h ∘ f = g.Finally, suppose there is a function h^′ : A → B such that h^′≠h and h^′∘ f = g = h ∘ f.Now, for any b in the range of f there is a in A such thatf(a) = b.Therefore h^′(b) = h^′(f(a)) = (h^′∘ f)(a) = g = (h ∘ f)(a).It follows that: h^′/range(f) = h/range(f)and this completes the proof. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spyratos, N., Sugibuchi, T. HIFUN - a high level functional query language for big data analytics. J Intell Inf Syst 51, 529–555 (2018). https://doi.org/10.1007/s10844-018-0495-6

Download citation

Received: 25 April 2017
Revised: 03 January 2018
Accepted: 08 January 2018
Published: 06 February 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10844-018-0495-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HIFUN - a high level functional query language for big data analytics

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Proof Proof of Proposition 1

Proof Proof: Proposition 2

Proof Proof of Proposition 3.1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HIFUN - a high level functional query language for big data analytics

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof Proof of Proposition 1

Proof Proof: Proposition 2

Proof Proof of Proposition 3.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation