Skip to main content
Log in

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Armbrust M, Curtis K, Kraska T, Fox A, Franklin MJ, Patterson DA (2011) Piql: Success-tolerant query processing in the cloud. Proc VLDB Endowment 5(3):181–192

    Article  Google Scholar 

  2. Badidi E (2013) A cloud service broker for sla-based saas provisioning. In: 2013 International conference on information society (i-Society). IEEE, pp 61–66

  3. Bishop CM (2006) Pattern recognition and machine learning. Springer

  4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  5. Bruno N, Jain S, Zhou J (2013) Continuous cloud-scale query optimization and processing. Proc VLDB Endowment 6(11):961–972

    Article  Google Scholar 

  6. Buhl HU, Röglinger M, Moser DKF, Heidemann J (2013) Big data. Bus Inf Syst Eng 5(2):65–69

    Article  Google Scholar 

  7. Chen T, Bahsoon R (2014) Symbiotic and sensitivity-aware architecture for globally-optimal benefit in self-adaptive cloud. In: 9th International symposium on software engineering for adaptive and self-managing systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593931, Hyderabad, pp 85–94

  8. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng

  9. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media

  10. Deng C, Zu Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36(3):253–281

    Article  Google Scholar 

  11. Fukunaga K, Narendra P M (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753

    Article  MathSciNet  MATH  Google Scholar 

  12. Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC), IEEE, pp 000,618–000,624

  13. Gruenheid A, Dong X L, Srivastava D (2014) Incremental record linkage. Proc VLDB Endowment 7 (9):697–708

    Article  Google Scholar 

  14. Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28

    Article  Google Scholar 

  15. Hsueh SC, Lin MY, Chiu YC (2014) A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In: Proceedings of the Twelfth Australasian symposium on parallel and distributed computing-volume, vol 152. Australian Computer Society, Inc., pp 3–9

  16. Huo Y, Zhuang Y, Gu J, Ni S, Xue Y (2015) Discrete gbest-guided artificial bee colony algorithm for cloud service composition. Appl Intell 42(4):661–678

    Article  Google Scholar 

  17. Ioannou E, Rassadko N, Velegrakis Y (2013) On generating benchmark data for entity matching. J Data Semant 2(1):37–56

    Article  Google Scholar 

  18. Jamshidi P, Ahmad A, Pahl C (2014) Autonomic resource provisioning for cloud-based software. In: 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593940, Hyderabad, pp 95–104

  19. Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. IEEE

  20. Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endowment 5(12):1878–1881

    Article  Google Scholar 

  21. Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63

    Article  Google Scholar 

  22. Kouki Y, Ledoux T (2013) Scaling: Sla-driven cloud auto-scaling. In: Proceedings of the 28th Annual ACM symposium on applied computing. ACM, pp 411–414

  23. Kozak S, Zezula P (2013) Efficiency and security in similarity cloud services. Proc VLDB Endowment 6 (12):1450–1455

    Article  Google Scholar 

  24. Ll Berral J, Gavaldà R, Torres J (2013) Empowering automatic data-center management with machine learning. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, New York, pp 170–172. doi:10.1145/2480362.2480397

  25. Loshin D (2010) The practitioner’s guide to data quality improvement. Elsevier

  26. Mestre D G, Pires C E (2014) Efficient entity matching over multiple data sources with mapreduce. J Inf Data Manag 5(1):40

    Google Scholar 

  27. Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 981–987. doi:10.1145/2695664.2695757

  28. Nascimento DC, Pires CE, Mestre DG (2015) A data quality-aware cloud service based on metaheuristic and machine learning provisioning algorithms. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 1696–1703. doi:10.1145/2695664.2695753

  29. Quiroz A, Kim H, Parashar M, Gnanasambandam N, Sharma N (2009) Towards autonomic workload provisioning for enterprise grids and clouds. In: 2009 10th IEEE/ACM international conference on grid computing. IEEE , pp 50–57

  30. Reynolds MB, Hopkinson KM, Oxley ME, Mullins BE (2011) Provisioning norm: An asymmetric quality measure for saas resource allocation. In: IEEE international conference on services computing (SCC), vol 2011. IEEE, pp 112–119

  31. Sait SM, Shahid KS (2015) Engineering simulated evolution for virtual machine assignment problem. Appl Intell:1–12

  32. Schnjakin M, Alnemr R, Meinel C (2010) Contract-based cloud architecture. In: Proceedings of the second international workshop on Cloud data management. ACM, pp 33–40

  33. Sidi F, Shariat Panahy P, Affendey L S, Jabar M A, Ibrahim H, Mustapha A (2012) Data quality: a survey of data quality dimensions 2012 International conference on information retrieval & knowledge management (CAMP). IEEE, pp 300–304

  34. Trovati M, Hill R, Zhu SY, Liu L (2015) Big-data analytics and cloud computing. Springer International Publishing. doi:10.1007/978-3-319-25313-8

  35. Wang J, Gong B, Liu H, Li S (2015) Multidisciplinary approaches to artificial swarm intelligence for heterogeneous computing and cloud scheduling. Appl Intell:1–14

  36. Witten I H, Frank E, Trigg L E, Hall M A, Holmes G, Cunningham SJ (1999) Weka: practical machine learning tools and techniques with java implementations

  37. Xiong P, Chi Y, Zhu S, Moon HJ, Pu C, Hacigumus H (2011) Intelligent management of virtualized resources for database systems in cloud environment. In: 2011 IEEE 27th international conference on data engineering (ICDE). IEEE , pp 87–98

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimas Cassimiro Nascimento.

Appendix : A: Adopted notation

Appendix : A: Adopted notation

In Table 10, we summarize the main notations adopted throughout the paper.

Table 10 Adopted notation

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nascimento, D.C., Pires, C.E. & Mestre, D.G. Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45, 530–548 (2016). https://doi.org/10.1007/s10489-016-0774-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0774-2

Keywords

Navigation