Abstract
Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Armbrust M, Curtis K, Kraska T, Fox A, Franklin MJ, Patterson DA (2011) Piql: Success-tolerant query processing in the cloud. Proc VLDB Endowment 5(3):181–192
Badidi E (2013) A cloud service broker for sla-based saas provisioning. In: 2013 International conference on information society (i-Society). IEEE, pp 61–66
Bishop CM (2006) Pattern recognition and machine learning. Springer
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bruno N, Jain S, Zhou J (2013) Continuous cloud-scale query optimization and processing. Proc VLDB Endowment 6(11):961–972
Buhl HU, Röglinger M, Moser DKF, Heidemann J (2013) Big data. Bus Inf Syst Eng 5(2):65–69
Chen T, Bahsoon R (2014) Symbiotic and sensitivity-aware architecture for globally-optimal benefit in self-adaptive cloud. In: 9th International symposium on software engineering for adaptive and self-managing systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593931, Hyderabad, pp 85–94
Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media
Deng C, Zu Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36(3):253–281
Fukunaga K, Narendra P M (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753
Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC), IEEE, pp 000,618–000,624
Gruenheid A, Dong X L, Srivastava D (2014) Incremental record linkage. Proc VLDB Endowment 7 (9):697–708
Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Hsueh SC, Lin MY, Chiu YC (2014) A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In: Proceedings of the Twelfth Australasian symposium on parallel and distributed computing-volume, vol 152. Australian Computer Society, Inc., pp 3–9
Huo Y, Zhuang Y, Gu J, Ni S, Xue Y (2015) Discrete gbest-guided artificial bee colony algorithm for cloud service composition. Appl Intell 42(4):661–678
Ioannou E, Rassadko N, Velegrakis Y (2013) On generating benchmark data for entity matching. J Data Semant 2(1):37–56
Jamshidi P, Ahmad A, Pahl C (2014) Autonomic resource provisioning for cloud-based software. In: 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2014, Proceedings. doi:10.1145/2593929.2593940, Hyderabad, pp 95–104
Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. IEEE
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endowment 5(12):1878–1881
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63
Kouki Y, Ledoux T (2013) Scaling: Sla-driven cloud auto-scaling. In: Proceedings of the 28th Annual ACM symposium on applied computing. ACM, pp 411–414
Kozak S, Zezula P (2013) Efficiency and security in similarity cloud services. Proc VLDB Endowment 6 (12):1450–1455
Ll Berral J, Gavaldà R, Torres J (2013) Empowering automatic data-center management with machine learning. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, New York, pp 170–172. doi:10.1145/2480362.2480397
Loshin D (2010) The practitioner’s guide to data quality improvement. Elsevier
Mestre D G, Pires C E (2014) Efficient entity matching over multiple data sources with mapreduce. J Inf Data Manag 5(1):40
Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 981–987. doi:10.1145/2695664.2695757
Nascimento DC, Pires CE, Mestre DG (2015) A data quality-aware cloud service based on metaheuristic and machine learning provisioning algorithms. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, New York, pp 1696–1703. doi:10.1145/2695664.2695753
Quiroz A, Kim H, Parashar M, Gnanasambandam N, Sharma N (2009) Towards autonomic workload provisioning for enterprise grids and clouds. In: 2009 10th IEEE/ACM international conference on grid computing. IEEE , pp 50–57
Reynolds MB, Hopkinson KM, Oxley ME, Mullins BE (2011) Provisioning norm: An asymmetric quality measure for saas resource allocation. In: IEEE international conference on services computing (SCC), vol 2011. IEEE, pp 112–119
Sait SM, Shahid KS (2015) Engineering simulated evolution for virtual machine assignment problem. Appl Intell:1–12
Schnjakin M, Alnemr R, Meinel C (2010) Contract-based cloud architecture. In: Proceedings of the second international workshop on Cloud data management. ACM, pp 33–40
Sidi F, Shariat Panahy P, Affendey L S, Jabar M A, Ibrahim H, Mustapha A (2012) Data quality: a survey of data quality dimensions 2012 International conference on information retrieval & knowledge management (CAMP). IEEE, pp 300–304
Trovati M, Hill R, Zhu SY, Liu L (2015) Big-data analytics and cloud computing. Springer International Publishing. doi:10.1007/978-3-319-25313-8
Wang J, Gong B, Liu H, Li S (2015) Multidisciplinary approaches to artificial swarm intelligence for heterogeneous computing and cloud scheduling. Appl Intell:1–14
Witten I H, Frank E, Trigg L E, Hall M A, Holmes G, Cunningham SJ (1999) Weka: practical machine learning tools and techniques with java implementations
Xiong P, Chi Y, Zhu S, Moon HJ, Pu C, Hacigumus H (2011) Intelligent management of virtualized resources for database systems in cloud environment. In: 2011 IEEE 27th international conference on data engineering (ICDE). IEEE , pp 87–98
Author information
Authors and Affiliations
Corresponding author
Appendix : A: Adopted notation
Appendix : A: Adopted notation
In Table 10, we summarize the main notations adopted throughout the paper.
Rights and permissions
About this article
Cite this article
Nascimento, D.C., Pires, C.E. & Mestre, D.G. Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45, 530–548 (2016). https://doi.org/10.1007/s10489-016-0774-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0774-2