Abstract
We present a high level query language, called HIFUN, for defining analytic queries over big datasets, independently of how these queries are evaluated. An analytic query in HIFUN is defined to be a well-formed expression of a functional algebra that we define in the paper. The operations of this algebra combine functions to create HIFUN queries in much the same way as the operations of the relational algebra combine relations to create algebraic queries. The contributions of this paper are: (a) the definition of a formal framework in which to study analytic queries in the abstract; (b) the encoding of a HIFUN query either as a MapReduce job or as an SQL group-by query; and (c) the definition of a formal method for rewriting HIFUN queries and, as a case study, its application to the rewriting of MapReduce jobs and of SQL group-by queries. We emphasize that, although theoretical in nature, our work uses only basic and well known mathematical concepts, namely functions and their basic operations.
Similar content being viewed by others
References
Abadi, D., Boncz, P., Harizopoulos, S., Idreos, S., Madden, S., et al. (2013). The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 197–280.
Apache Software Foundation. (2017). Apache project list - category: Big data https://projects.apache.org/projects.html?category.
Buneman, P., & Frankel, R.E. (1979). Fql: A functional query language. In Proceedings of ACM SIGMOD international conference on management of data. https://doi.org/10.1145/582095.582104 (pp. 52–58). New York: ACM.
Cattell, R. (2011). Scalable sql and nosql data stores. SIGMOD Record, 39 (4), 12–27. https://doi.org/10.1145/1978915.1978919.
Cutting, D., & Cafarella, M. (2005). Hadoop http://hadoop.apache.org/.
Drineas, P., & Huo, X. (2016). Theoretical foundations of data science http://www.cs.rpi.edu/TFoDS/.
Gudivada, V.N., Rao, D., & Raghavan, V.V. (2014). Nosql systems for big data management. In 2014 IEEE World congress on services (pp. 190–197).
Halevy, A.Y. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. https://doi.org/10.1007/s007780100054.
House, W. (2012). Big data across the federal government http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdf.
House, W. (2014). Big data: Seizing opportunities, preserving values http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf.
Maier, D. (1983). Theory of relational databases. New York: Computer Science Press.
Pierce, B.C. (1991). Basic category theory for computer scientists. Cambridge: MIT Press.
Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: a brief guide to the emerging world of polyglot persistence. Boston: Addison-Wesley Professional.
Ślezak, D., Glick, R., Betliński, P., & Synak, P. (2017). A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries. Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-017-0471-6.
Spyratos, N. (2006). A functional model for data analysis. In Proceedings of the 7th international conference on flexible query answering systems (pp. 51–64). Berlin: FQAS’06, Springer.
Spyratos, N., & Sugibuchi, T. (2013). Restrict-reduce: Parallelism and rewriting for big data processing. In Tanaka, Y., Spyratos, N., Yoshida, T., & Meghini, C. (Eds.) Information search, integration and personalization, communications in computer and information science. https://doi.org/10.1007/978-3-642-40140-4_2, (Vol. 146 pp. 11–20). Berlin : Springer.
Spyratos, N., & Sugibuchi, T. (2014). A high level query language for big data analytics, Tech. Rep. 1575, CNRS – Université Paris Sud – LRI, Orsay. https://www.lri.fr/bibli/Rapports-internes/2014/RR1575.pdf.
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., & Rasin, A. (2010). Mapreduce and parallel dbmss: Friends or foes? Communications of the ACM, 53(1), 64–71. https://doi.org/10.1145/1629175.1629197.
Vidal, M., Raschid, L., Marquez, N., Cardenas, M., & Wu, Y. (2006). Query rewriting in the semantic web. In Proceedings 22nd international conference on data engineering workshops, 2006 (pp. 7–7).
Zaharia, M. (2016). An architecture for fast and general data processing on large clusters. Claypool: Association for Computing Machinery and Morgan &.
Author information
Authors and Affiliations
Corresponding author
Additional information
Work conducted while the first author was visiting at FORTH Institute of Computer Science, Crete, Greece (https://www.ics.forth.gr/)
Appendix
Appendix
Proof Proof of Proposition 1
Let A and B be two nodes of \(\mathcal {C}\),and suppose there is a cycle on A consisting of two functions:f : A → B and g : B → A.Suppose that there is some element a in A such thatg ∘ f(a) = a′and a≠a′.This implies that a′depends on a, a contradiction to our assumption that the nodes of\(\mathcal {C}\)represent sets of independent values. Therefore the only possibility to have a cycle on A is wheng ∘ f(a) = a for all a in A; in other words the only possible cycle on A isιA,where ιAis the identity function on A. □
Proof Proof: Proposition 2
Observe first that, as ansQis a function with source B, the query Q″is well formed, that is, its grouping function g and its measuring functionansQhave the same source (namely B), and op is an operation on the target ofansQ.
Let c ∈ C.It follows from well known properties of functions that:
ifg− 1(c) = {b1,…,bk}then\((g \circ f)^{-1}(c) = f^{-1}(b_{1}) \bigcup {\ldots } \bigcup f^{-1}(b_{k})\)
From our definition of answer, we have:
\(ans_{Q^{\prime }}(c) = red(m/(g \circ f)^{-1}(c), op)\)
As op is a distributive operation and the family{f− 1(b1),…,f− 1(bk)}is a partition of (g ∘ f)− 1(c),we have:
Therefore \(ans_{Q^{\prime }}(c) = ans_{Q^{\prime \prime }}(c)\)for all c in C and this concludes the proof. □
Proof Proof of Proposition 3.1
Suppose first that there is a function h : A → B such that g = h ∘ f.Then for all a, a′in A such that f(a) = f(a′)we have: g(a) = h(f(a)) = h(f(a′)) = g(a′).Therefore f ≤ g.Suppose next that f ≤ g.It follows that πf ≤ πg.Consider now any b in the range of f and define:h(b) = g(f− 1(b)).As f ≤ g,the block f− 1(b)of πfis included in some block of πg,say g− 1(c),where c is in the range of g. It follows that:g(f− 1(b)) = c,therefore h is a well defined function over the rangeof f . Moreover, from the definition of h we have:h(f(a)) = g(f− 1(b)) = g(b).Therefore h ∘ f = g.Finally, suppose there is a function h′ : A → B such that h′≠h and h′∘ f = g = h ∘ f.Now, for any b in the range of f there is a in A such thatf(a) = b.Therefore h′(b) = h′(f(a)) = (h′∘ f)(a) = g = (h ∘ f)(a).It follows that: h′/range(f) = h/range(f)and this completes the proof. □
Rights and permissions
About this article
Cite this article
Spyratos, N., Sugibuchi, T. HIFUN - a high level functional query language for big data analytics. J Intell Inf Syst 51, 529–555 (2018). https://doi.org/10.1007/s10844-018-0495-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-018-0495-6