Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Nascimento, Dimas Cassimiro; Pires, Carlos Eduardo; Mestre, Demetrio Gomes

doi:10.1007/s10489-016-0774-2

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Published: 02 April 2016

Volume 45, pages 530–548, (2016)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Dimas Cassimiro Nascimento^1,2,
Carlos Eduardo Pires¹ &
Demetrio Gomes Mestre¹

1056 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Energy Saving in Cloud by Using Enhanced Instance Based Learning (EIBL) for Resource Prediction

Effectiveness Review of the Machine Learning Algorithms for Scheduling in Cloud Environment

Article 31 March 2023

G. Umarani Srikanth & R. Geetha

Methods for virtual machine scheduling with uncertain execution times in cloud computing

Article 07 September 2017

Haiyan Xu & Xiaoping Li

References

Armbrust M, Curtis K, Kraska T, Fox A, Franklin MJ, Patterson DA (2011) Piql: Success-tolerant query processing in the cloud. Proc VLDB Endowment 5(3):181–192
Article Google Scholar
Badidi E (2013) A cloud service broker for sla-based saas provisioning. In: 2013 International conference on information society (i-Society). IEEE, pp 61–66
Bishop CM (2006) Pattern recognition and machine learning. Springer
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MathSciNet MATH Google Scholar
Bruno N, Jain S, Zhou J (2013) Continuous cloud-scale query optimization and processing. Proc VLDB Endowment 6(11):961–972
Article Google Scholar
Buhl HU, Röglinger M, Moser DKF, Heidemann J (2013) Big data. Bus Inf Syst Eng 5(2):65–69
Article Google Scholar
Chen T, Bahsoon R (2014) Symbiotic and sensitivity-aware architecture for globally-optimal benefit in self-adaptive cloud. In: 9th International symposium on software engineering for adaptive and self-managing systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593931, Hyderabad, pp 85–94
Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media
Deng C, Zu Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36(3):253–281
Article Google Scholar
Fukunaga K, Narendra P M (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753
Article MathSciNet MATH Google Scholar
Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC), IEEE, pp 000,618–000,624
Gruenheid A, Dong X L, Srivastava D (2014) Incremental record linkage. Proc VLDB Endowment 7 (9):697–708
Article Google Scholar
Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Article Google Scholar
Hsueh SC, Lin MY, Chiu YC (2014) A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In: Proceedings of the Twelfth Australasian symposium on parallel and distributed computing-volume, vol 152. Australian Computer Society, Inc., pp 3–9
Huo Y, Zhuang Y, Gu J, Ni S, Xue Y (2015) Discrete gbest-guided artificial bee colony algorithm for cloud service composition. Appl Intell 42(4):661–678
Article Google Scholar
Ioannou E, Rassadko N, Velegrakis Y (2013) On generating benchmark data for entity matching. J Data Semant 2(1):37–56
Article Google Scholar
Jamshidi P, Ahmad A, Pahl C (2014) Autonomic resource provisioning for cloud-based software. In: 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593940, Hyderabad, pp 95–104
Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. IEEE
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endowment 5(12):1878–1881
Article Google Scholar
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63
Article Google Scholar
Kouki Y, Ledoux T (2013) Scaling: Sla-driven cloud auto-scaling. In: Proceedings of the 28th Annual ACM symposium on applied computing. ACM, pp 411–414
Kozak S, Zezula P (2013) Efficiency and security in similarity cloud services. Proc VLDB Endowment 6 (12):1450–1455
Article Google Scholar
Ll Berral J, Gavaldà R, Torres J (2013) Empowering automatic data-center management with machine learning. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, New York, pp 170–172. doi:10.1145/2480362.2480397
Loshin D (2010) The practitioner’s guide to data quality improvement. Elsevier
Mestre D G, Pires C E (2014) Efficient entity matching over multiple data sources with mapreduce. J Inf Data Manag 5(1):40
Google Scholar
Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 981–987. doi:10.1145/2695664.2695757
Nascimento DC, Pires CE, Mestre DG (2015) A data quality-aware cloud service based on metaheuristic and machine learning provisioning algorithms. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 1696–1703. doi:10.1145/2695664.2695753
Quiroz A, Kim H, Parashar M, Gnanasambandam N, Sharma N (2009) Towards autonomic workload provisioning for enterprise grids and clouds. In: 2009 10th IEEE/ACM international conference on grid computing. IEEE , pp 50–57
Reynolds MB, Hopkinson KM, Oxley ME, Mullins BE (2011) Provisioning norm: An asymmetric quality measure for saas resource allocation. In: IEEE international conference on services computing (SCC), vol 2011. IEEE, pp 112–119
Sait SM, Shahid KS (2015) Engineering simulated evolution for virtual machine assignment problem. Appl Intell:1–12
Schnjakin M, Alnemr R, Meinel C (2010) Contract-based cloud architecture. In: Proceedings of the second international workshop on Cloud data management. ACM, pp 33–40
Sidi F, Shariat Panahy P, Affendey L S, Jabar M A, Ibrahim H, Mustapha A (2012) Data quality: a survey of data quality dimensions 2012 International conference on information retrieval & knowledge management (CAMP). IEEE, pp 300–304
Trovati M, Hill R, Zhu SY, Liu L (2015) Big-data analytics and cloud computing. Springer International Publishing. doi:10.1007/978-3-319-25313-8
Wang J, Gong B, Liu H, Li S (2015) Multidisciplinary approaches to artificial swarm intelligence for heterogeneous computing and cloud scheduling. Appl Intell:1–14
Witten I H, Frank E, Trigg L E, Hall M A, Holmes G, Cunningham SJ (1999) Weka: practical machine learning tools and techniques with java implementations
Xiong P, Chi Y, Zhu S, Moon HJ, Pu C, Hacigumus H (2011) Intelligent management of virtualized resources for database systems in cloud environment. In: 2011 IEEE 27th international conference on data engineering (ICDE). IEEE , pp 87–98

Download references

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Campina Grande, Campina Grande, Brazil
Dimas Cassimiro Nascimento, Carlos Eduardo Pires & Demetrio Gomes Mestre
Federal Rural University of Pernambuco, Recife, Brazil
Dimas Cassimiro Nascimento

Authors

Dimas Cassimiro Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Eduardo Pires
View author publications
You can also search for this author in PubMed Google Scholar
Demetrio Gomes Mestre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimas Cassimiro Nascimento.

Appendix : A: Adopted notation

In Table 10, we summarize the main notations adopted throughout the paper.

Table 10 Adopted notation

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nascimento, D.C., Pires, C.E. & Mestre, D.G. Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45, 530–548 (2016). https://doi.org/10.1007/s10489-016-0774-2

Download citation

Published: 02 April 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10489-016-0774-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Abstract

Access this article

Similar content being viewed by others

Energy Saving in Cloud by Using Enhanced Instance Based Learning (EIBL) for Resource Prediction

Effectiveness Review of the Machine Learning Algorithms for Scheduling in Cloud Environment

Methods for virtual machine scheduling with uncertain execution times in cloud computing

References

Author information

Authors and Affiliations

Corresponding author

Appendix : A: Adopted notation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Abstract

Access this article

Similar content being viewed by others

Energy Saving in Cloud by Using Enhanced Instance Based Learning (EIBL) for Resource Prediction

Effectiveness Review of the Machine Learning Algorithms for Scheduling in Cloud Environment

Methods for virtual machine scheduling with uncertain execution times in cloud computing

References

Author information

Authors and Affiliations

Corresponding author

Appendix : A: Adopted notation

Appendix : A: Adopted notation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation