Abstract
Functional dependencies (FDs) are important metadata that describe relationships among columns of datasets and can be used in a number of tasks, such as schema normalization, data cleansing. In modern big data environments, data are partitioned, so that single-node FD discovery algorithms are inefficient without parallelization. However, existing parallel distributed algorithms bring huge communication costs and thus perform not well enough.
To solve this problem, we propose a general parallel discovery strategy, called GDS, to improve the performance of parallelization for single-node algorithms. GDS consists of two essential building blocks, namely FD-Combine algorithm and affine plane block design algorithm. The former can infer the final FDs from part-FD sets. The part-FD set is a FD set holding over part of the original dataset. The latter generates data blocks, making sure that part-FD sets of data blocks satisfy FD-Combine induction condition. With our strategy, each single-node FD discovery algorithm can be directly parallelized without modification in distributed environments. In the evaluation, with p threads, the speedups of FD discovery algorithm FastFDs exceed \(\sqrt{p}\) in most cases and even exceed p/2 in some cases. In distributed environments, the best multi-threaded algorithm HYFD also gets a significant improvement with our strategy when the number of threads is large.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Huhtala, Y., et al.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Flach, P.A., Savnik, I.: Database dependency discovery: a machine learning approach. Ai Commun. 12(3), 139–160 (1999)
Lopes, S., Petit, J.-M., Lakhal, L.: Efficient discovery of functional dependencies and armstrong relations. In: Zaniolo, C., Lockemann, P.C., Scholl, M.H., Grust, T. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 350–364. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46439-5_24
Novelli, N., Cicchetti, R.: FUN: an efficient algorithm for mining functional and embedded dependencies. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 189–203. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-X_13
Wyss, C., Giannella, C., Robertson, E.: FastFDs: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 101–110. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44801-2_11
Abedjan, Z., Schulze, P., Naumann, F., et al.: DFD: efficient functional dependency discovery. In: CIKM (2014)
Garnaud, E., et al.: Parallel mining of dependencies. In: HPCS (2014)
Li, W., Li, Z., Chen, Q., et al.: Discovering functional dependencies in vertically distributed big data. In: WISE (2015)
Papenbrock, T., et al.: Functional dependency discovery: an experimental evaluation of seven algorithms. In: VLDB (2015)
Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: ICMD (2016)
Saxena, H., Golab, L., Ilyas, I.F., et al.: Distributed implementations of dependency discovery algorithms. In: Very Large Data Bases, vol. 12, no. 11, pp. 1624–1636 (2019)
Acknowledgments
This work was supported by the Anhui Initiative in Quantum Information Technologies (No. AHY150300).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, P., Yang, W., Wang, H., Huang, L. (2020). GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12112. Springer, Cham. https://doi.org/10.1007/978-3-030-59410-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-59410-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59409-1
Online ISBN: 978-3-030-59410-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)