Abstract
As warehouse data volumes expand, single-node solutions can no longer analyze the immense volume of data. Therefore, it is necessary to use shared nothing architectures such as MapReduce. Inter-node data segmentation in MapReduce creates node connectivity issues, network congestion, improper use of node memory capacity and inefficient processing power. In addition, it is not possible to change dimensions and measures without changing previously stored data and big dimension management. In this paper, a method called Atrak is proposed, which uses a unified data format to make Mapper nodes independent to solve the data management problem mentioned earlier. The proposed method can be applied to star schema data warehouse models with distributive measures. Atrak increases query execution speed by employing node independence and the proper use of MapReduce. The proposed method was compared to established methods such as Hive, Spark-SQL, HadoopDB and Flink. Simulation results confirm improved query execution speed of the proposed method. Using data unification in MapReduce can be used in other fields, such as data mining and graph processing.



Similar content being viewed by others
Notes
Mohammad Hossein Barkhordari Query Language.
References
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Krishnan K (2013) Data warehousing in the age of big data. Newnes, p 23
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam, p 51
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380
Eltabakh MY et al (2011) CoHadoop: flexible data placement and its exploitation in Hadoop. Proc VLDB Endow 4.9:575–585
Lin Y et al (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM
Chen S (2010) Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc VLDB Endow 3(1–2):1459–1468
He Y et al (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: IEEE 27th International Conference on Data Engineering (ICDE), 2011, IEEE
Floratou A et al (2011) Column-oriented storage techniques for MapReduce. Proc VLDB Endow 4.7:419–429
Nykiel T et al (2010) MRShare: sharing across multiple queries in MapReduce. Proc VLDB Endow 3.1–2:494–505
Elghandour I, Aboulnaba A (2012) ReStore: reusing results of MapReduce jobs. Proc VLD B Endow 5.6:586–597
Olston C et al (2008) Pig latin: a not so foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM
Dittrich J et al (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc VLDB Endow 3.1–2:515–529
Dittrich J et al (2012) Only aggressive elephants are fast elephants. Proc Endow 5.11:1591–16902
Abouzeid A et al (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow 2.1:922–933
Vernica R et al (2012) Adaptive MapReduce using situation aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM
Kaldewey T, Shekita EJ, Tata S (2012) Clydesdale: structured data processing on MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM
Thusoo A et al (2009) Hive: a warehousing solution over a MapReduce framework. Proc VLDB Endow 2.2:1626–1629
Engle C et al (2012) Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, pp 689–692
Armbrust M et al (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM
Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol 10
Carbone P et al (2015) Apache flink: stream and batch processing in a single engine. Data Eng 28
Barkhordari M, Niamanesh M (2017) Aras: a method with uniform distributed dataset to solve data warehouse problems for big data. Int J Distrib Sys Technol (IJDST) 8(2):47–60
Barkhordari M, Niamanesh M (2017) ScaDiGraph: a MapReduce-based method for solving graph problems. J Inf Sci Eng 33(1)
Barkhordari M, Niamanesh M (2014) ScadiBino: an effective MapReduce-based association rule mining method. In: Proceedings of the 16th International Conference on Electronic Commerce. ACM
Barkhordari M, Niamanesh M (2015) ScaDiPaSi: an effective scalable and distributable MapReduce-based method to find patient similarity on huge healthcare networks. Big Data Res 2(1):19–27
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Barkhordari, M., Niamanesh, M. Atrak: a MapReduce-based data warehouse for big data. J Supercomput 73, 4596–4610 (2017). https://doi.org/10.1007/s11227-017-2037-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2037-3