Skip to main content
Log in

HIFUN - a high level functional query language for big data analytics

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

We present a high level query language, called HIFUN, for defining analytic queries over big datasets, independently of how these queries are evaluated. An analytic query in HIFUN is defined to be a well-formed expression of a functional algebra that we define in the paper. The operations of this algebra combine functions to create HIFUN queries in much the same way as the operations of the relational algebra combine relations to create algebraic queries. The contributions of this paper are: (a) the definition of a formal framework in which to study analytic queries in the abstract; (b) the encoding of a HIFUN query either as a MapReduce job or as an SQL group-by query; and (c) the definition of a formal method for rewriting HIFUN queries and, as a case study, its application to the rewriting of MapReduce jobs and of SQL group-by queries. We emphasize that, although theoretical in nature, our work uses only basic and well known mathematical concepts, namely functions and their basic operations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Spyratos.

Additional information

Work conducted while the first author was visiting at FORTH Institute of Computer Science, Crete, Greece (https://www.ics.forth.gr/)

Appendix

Appendix

Proof Proof of Proposition 1

Let A and B be two nodes of \(\mathcal {C}\),and suppose there is a cycle on A consisting of two functions:f : AB and g : BA.Suppose that there is some element a in A such thatgf(a) = aand aa.This implies that adepends on a, a contradiction to our assumption that the nodes of\(\mathcal {C}\)represent sets of independent values. Therefore the only possibility to have a cycle on A is whengf(a) = a for all a in A; in other words the only possible cycle on A isιA,where ιAis the identity function on A. □

Proof Proof: Proposition 2

Observe first that, as ansQis a function with source B, the query Qis well formed, that is, its grouping function g and its measuring functionansQhave the same source (namely B), and op is an operation on the target ofansQ.

Let cC.It follows from well known properties of functions that:

ifg− 1(c) = {b1,…,bk}then\((g \circ f)^{-1}(c) = f^{-1}(b_{1}) \bigcup {\ldots } \bigcup f^{-1}(b_{k})\)

From our definition of answer, we have:

\(ans_{Q^{\prime }}(c) = red(m/(g \circ f)^{-1}(c), op)\)

As op is a distributive operation and the family{f− 1(b1),…,f− 1(bk)}is a partition of (gf)− 1(c),we have:

$$\begin{array}{@{}rcl@{}} ans_{Q^{\prime}}(c) &=& red(m/(g \circ f)^{-1}(c), op) \\ &=& op(red(m/f^{-1}(b_{1}), op), \ldots, red(m/f^{-1}(b_{k}), op)) \\ &=& op(ans_{Q}(b_{1}), \ldots, ans_{Q}(b_{k})) [because red(m/f^{-1}(b_{i}), op))= ans_{Q^{\prime}}(b_{i})\\ &=& red(m/g^{-1}(c), op) \\ &=& ans_{Q^{\prime\prime}}(c) \end{array} $$

Therefore \(ans_{Q^{\prime }}(c) = ans_{Q^{\prime \prime }}(c)\)for all c in C and this concludes the proof. □

Proof Proof of Proposition 3.1

Suppose first that there is a function h : AB such that g = hf.Then for all a, ain A such that f(a) = f(a)we have: g(a) = h(f(a)) = h(f(a)) = g(a).Therefore fg.Suppose next that fg.It follows that πfπg.Consider now any b in the range of f and define:h(b) = g(f− 1(b)).As fg,the block f− 1(b)of πfis included in some block of πg,say g− 1(c),where c is in the range of g. It follows that:g(f− 1(b)) = c,therefore h is a well defined function over the rangeof f . Moreover, from the definition of h we have:h(f(a)) = g(f− 1(b)) = g(b).Therefore hf = g.Finally, suppose there is a function h : AB such that hh and hf = g = hf.Now, for any b in the range of f there is a in A such thatf(a) = b.Therefore h(b) = h(f(a)) = (hf)(a) = g = (hf)(a).It follows that: h/range(f) = h/range(f)and this completes the proof. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spyratos, N., Sugibuchi, T. HIFUN - a high level functional query language for big data analytics. J Intell Inf Syst 51, 529–555 (2018). https://doi.org/10.1007/s10844-018-0495-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-018-0495-6

Keywords

Navigation