SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

Ahmadvand, Hossein; Goudarzi, Maziar

doi:10.1007/s11227-019-02797-7

SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

Published: 03 April 2019

Volume 75, pages 5760–5781, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

184 Accesses
11 Citations
Explore all metrics

Abstract

Nowadays, a wide range of enterprises are faced with big data processing in different domains such as transaction operations, business calculations and analytical computations. Large-scale computing is an approach for big data processing. Due to the cost of large-scale computing and limitations of enterprise budgets, it is hardly possible to process all the input data and therefore the Quality of Result (QoR) may be affected. SAIR is an approach to improve QoR of big data processing for aggregative usages based on significance variety when there is a budget constraint. In this paper, the most significant data portions have been assigned to the most efficient resources in terms of time and cost. If the budget is still available, other data portions have been assigned to remaining resources. In this approach, statistical methods and a sampling technique with a 95% of the confidence interval and 5% of error margin are used to identify the most and least significant data portions. By using this method, the users are able to improve QoR with respect to budget constraint and preferred finishing time. In the evaluation phase, applications from different domains such as document and text, transaction data and system logs are used. Our results indicate that SAIR improves QoR while meeting budget constraint for considered usages. This approach improves the QoR up to 15%, compared with the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

References

Barroso LA, Clidaras J, Hölzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, vol 8.3, 2nd edn. Morgan & Claypool, San Rafael, pp 1–154
Google Scholar
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Anal Future 2007:1–16
Google Scholar
Ahmadvand H, Goudarzi M (2017) Using data variety for efficient progressive big data processing in warehouse-scale computers. IEEE Comput Archit Lett 16(2):166–169
Article Google Scholar
Fekete J-D, Primet R (2016) Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv, vol. 1607.05162
Mittal S (2016) A survey of techniques for approximate computing. ACM CSUR 48:62
Google Scholar
Parasyris K, Vassiliadis V, Antonopoulos CD, Lalis S, Bellas N (2017) Significance-aware program execution on unreliable hardware. ACM TACO 14(2):12
Google Scholar
Zhao Y, Calheiros RN, Gange G, Ramamohanarao K, Buyya R (2015) SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In: 2015 44th International Conference on Parallel Processing (ICPP)
Honjo T, Oikawa K (2013) Hardware acceleration of hadoop mapreduce. In: 2013 IEEE International Conference on in Big Data
Shan Y, Wang B, Yan J, Wang Y, Xu N, Yang H (2010) FPMR: MapReduce framework on FPGA. In: Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays
Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25
Article Google Scholar
Mashayekhy L, Movahed Nejad M, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733
Article Google Scholar
Chandramouli B, Goldstein J, Quamar A (2013) Scalable progressive analytics on big data in the cloud. Proc VLDB Endow 6:1726–1737
Article Google Scholar
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In Nsdi
Wang Y, Shi W (2013) On optimal budget-driven scheduling algorithms for MapReduce jobs in the hetereogeneous cloud. Technical report TR-13–02, Carleton University
Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) Approxhadoop: bringing approximations to mapreduce frameworks. ACM SIGARCH Comput Archit News 43:383–397
Article Google Scholar
Ahmadvand H, Goudarzi M, Foroutan F (2019) Gapprox: using Gallup approach for approximation in big data processing. J Big Data 6(1):20
Article Google Scholar
Vassiliadis V, Riehme J, Deussen J, Parasyris K, Antonopoulos CD, Bellas N, Lalis S, Naumann U (2016) Towards automatic significance analysis for approximate computing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Chen Y, An A (2016) Approximate parallel high utility itemset mining. Big Data Res 6:26–42
Article Google Scholar
Zamani AR, AbdelBaky M, Balouek-Thomert D, Rodero I, Parashar M (2017) Supporting data-driven workflows enabled by large scale observatories. In: IEEE 13th International Conference on e-Science (e-Science), Auckland, New Zealand
Zhang X, Wang J, Yin J (2016) Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc VLDB Endow 10(3):109–120
Article Google Scholar
Li K, Li G (2018) Approximate query processing: what is new and where to go? Data Sci Eng 3(4):379–397
Article Google Scholar
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the European Conference on Computer Systems (EuroSys)
Zheng C, Zhan J, Jia Z, Zhang L (2013) Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013)
Lee Y, Lee Y (2011) Detecting ddos attacks with hadoop. In: Proceedings of The ACM CoNEXT Student Workshop
Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data
Kaur N, Sood SK (2017) Efficient resource management system based on 4Vs of big data streams. Big Data Research
Jiang Y, Huang Z, Tsang DHK (2018) Towards max–min fair resource allocation for stream big data analytics in shared clouds. IEEE Trans Big Data 4(1):130–137
Article Google Scholar
Kelley J, Stewart C, Morris N, Tiwari D, He Y, Elnikety S (2017) Obtaining and managing answer quality for online data-intensive services. ACM TOMPECS 2(2):11
Google Scholar
Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172
Article Google Scholar
Wang J, Zhang X, Yin J, Wang R, Wu H, Han D (2018) Speed up big data analytics by unveiling the storage distribution of sub-datasets. IEEE Trans Big Data 4(2):231–244
Article Google Scholar
Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM TODS 30:41–82
Article Google Scholar
Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. VLDB 1:301–310
Google Scholar
Zhang D, Du Y, Xia T, Tao Y (2006) Progressive computation of the min-dist optimal-location query. In: Proceedings of the 32nd International Conference on Very Large Data Bases
Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee
Conejero J, Corella S, Badia RM, Labarta J (2018) Task-based programming in COMPSs to converge from HPC to big data. Int J High Perform Comput Appl 32(1):45–60
Article Google Scholar
Qiu C, Shen H, Chen L (2018) Towards green cloud computing: demand allocation and pricing policies for cloud service brokerage. IEEE Trans Big Data. https://doi.org/10.1109/TBDATA.2018.2823330
Article Google Scholar
Mian R, Martin P, Vazquez-Poletti JL (2012) Provisioning data analytic workloads in a cloud. Future Gen Comput Syst 29(6):1452–1458
Article Google Scholar
Malekimajd M, Ardagna D, Ciavotta M, Gianniti E, Passacantando M, Rizzi AM (2018) An optimization framework for the capacity allocation. J Supercomput 74(10):5314–5348
Article Google Scholar
BigDataBench. http://prof.ict.ac.cn/. Accessed 15 Feb 2019
Cochran WG (2007) Sampling techniques. Wiley, Hoboken
MATH Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Welcome to Apache™ Hadoop^®! http://hadoop.apache.org/. Accessed 15 Feb 2019
Apache Spark™—lightning-fast cluster computing. http://www.spark-project.org/. Accessed 15 Feb 2019
RDD Programming Guide. https://spark.apache.org/docs/latest/rdd-programming-guide.html. Accessed 15 Feb 2019
Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/MHEALTH%20Dataset. Accessed 15 Feb 2019
Sample CSV Data. https://support.spatialkey.com/spatialkey-sample-csv-data/. Accessed 15 Feb 2019
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75
Article MathSciNet Google Scholar
Amazon EC2 Dedicated Instances. https://aws.amazon.com/ec2/purchasing-options/dedicated-instances/. Accessed 15 Feb 2019
Lohr SL (2009) Sampling: design and analysis. Cengage Learning, Boston
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Hossein Ahmadvand & Maziar Goudarzi

Authors

Hossein Ahmadvand
View author publications
You can also search for this author in PubMed Google Scholar
Maziar Goudarzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossein Ahmadvand.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmadvand, H., Goudarzi, M. SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint. J Supercomput 75, 5760–5781 (2019). https://doi.org/10.1007/s11227-019-02797-7

Download citation

Published: 03 April 2019
Issue Date: September 2019
DOI: https://doi.org/10.1007/s11227-019-02797-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation