Abstract
Complex expressions are the basis of data analytics. To process complex expressions on big data efficiently, we developed a novel optimization strategy for parallel computation platforms such as Hadoop and Spark. We attempted to minimize the rounds of data repartition to achieve high performance. Aiming at this goal, we modeled the expression as a graph and developed a simplification algorithm for this graph. Based on the graph, we converted the round minimization problem into a graph decomposition problem and developed a linear algorithm for it. We also designed appropriated implementation for the optimization strategy. Extensive experimental results demonstrate that the proposed approach could optimize the computation of complex expressions effectively with small cost.
Similar content being viewed by others
References
Aho AV, Sethi R, Ullman JD (1986) Compilers, principles, techniques. Addison Wesley 7(8):9
Althebyan Q, Jararweh Y, Yaseen Q, Alqudah O, Al-Ayyoub M (2016) Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure. Concurr Comput Pract Exp 27(18):5686–5699
Baaziz A, Quoniam L (2014) How to use big data technologies to optimize operations in upstream petroleum industry. CoRR. arXiv:abs/1412.0755
Bu Y, Howe B, Balazinska M (2010) HaLoop: efficient iterative data processing on large clusters. PVLDB 3(1):285–296. https://doi.org/10.14778/1920841.1920881
Church K, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon 115:164
Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492. https://doi.org/10.14778/1687553.1687576
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
Dodhia RM (2005) A review of applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). J Educ Behav Stat 30(2):227–229
Dorre J, Apel S, Lengauer C (2015) Modeling and optimizing MapReduce programs. Concurr Comput Pract Exp 27(7):1734–1766
Hsieh K (2019) Machine learning systems for highly-distributed and rapidly-growing data. CoRR. arXiv:abs/1910.08663
Huang Y, Shi Y, Zhong Z, Feng Y, Cheng J, Li J, Fan H, Li C, Guan T, Zhou J (2019) Yugong: geo-distributed data and job placement at scale. Proc. VLDB Endow. 12(12):2155–2169. https://doi.org/10.14778/3352063.3352132
Idcos code. https://github.com/hoverwinter/IDCOS
Idris M, Hussain S, Ali M, Abdulali A, Siddiqi MH, Kang BH, Lee S (2015) Context-aware scheduling in MapReduce: a compact review. Concurr Comput Pract Exp 27(17):5332–5349
Jaggi M, Smith V, Takác M, Terhorst J, Krishnan S, Hofmann T, Jordan MI (2014) Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp 3068–3076
Kloudas K, Rodrigues R, Preguiça NM, Mamede M (2015) PIXIDA: optimizing data parallel jobs in wide-area data analytics. PVLDB 9(2):72–83. https://doi.org/10.14778/2850578.2850582
Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG’10, Washington, D.C., USA, July 24–25, 2010, pp 78–85. https://doi.org/10.1145/1830252.1830263
Liu Y, Jing W, Liu Y, Lv L, Qi M, Xiang Y (2017) A sliding window-based dynamic load balancing for heterogeneous Hadoop clusters. Concurr Comput Pract Exp 29(3), n/a–n/a
Ma C, Smith V, Jaggi M, Jordan MI, Richtárik P, Takáč M (2015) Adding vs. averaging in distributed primal-dual optimization. ArXiv preprint arXiv:1502.03508
Mahout. http://mahout.apache.org
Marzuni SM, Savadi A, Toosi AN, Naghibzadeh M (2021) Cross-MapReduce: data transfer reduction in geo-distributed MapReduce. Future Gener Comput Syst 115:188–200. https://doi.org/10.1016/j.future.2020.09.009
R language. https://www.r-project.org/about.html
Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp 693–701
Sangat P, Indrawan-Santiago M, Taniar D (2018) Sensor data management in the cloud: data storage, data ingestion, and data retrieval. Concurr Comput Pract Exp 30(1). https://doi.org/10.1002/cpe.4354
Segal B, Robertson L, Gagliardi F, Carminati F (2000) Grid computing: the European data grid project. In: Nuclear Science Symposium Conference Record, vol 1, p 2/1
Torma B, Boglárka G (2010) An efficient descent direction method with cutting planes. Central Eur J Oper Res 18(2):105–130
Wang G, Venkataraman S, Phanishayee A, Thelin J, Devanur N, Stoica I (2019) Blink: fast and generic collectives for distributed ml. ArXiv preprint arXiv:1910.04940
White T (2009) Hadoop—the definitive guide: MapReduce for the cloud. O’Reilly. http://www.oreilly.de/catalog/9780596521974/index.html
Yao H, Xu J, Luo Z, Zeng D (2016) MEMoMR: accelerate MapReduce via reuse of intermediate results. Concurr Comput Pract Exp 28(14):3814–3829
Yu P, Chowdhury M (2019) Salus: fine-grained GPU sharing primitives for deep learning applications. ArXiv preprint arXiv:1902.04610
Yuan K, Ying B, Liu J, Sayed AH (2018) Variance-reduced stochastic learning by networked agents under random reshuffling. IEEE Trans Signal Process 67(2):351–366
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012, pp 15–28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets
Acknowledgements
This paper was supported by NSFC grant U1866602.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Song, Y., Jin, H., Wang, H. et al. IDCOS: optimization strategy for parallel complex expression computation on big data. J Supercomput 77, 10334–10356 (2021). https://doi.org/10.1007/s11227-021-03674-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03674-y