Abstract
Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal functional dependencies (MFDs) from relational databases is an important database analysis technique. However, as the data grows larger and larger in size, even the most efficient stand-alone algorithms are exponential in the number of attributes of the relations. Discovering MFDs on a single computer is hard and slow, and it can only be applied to small centralized datasets. It is challenging to discover MFDs from big data, especially large-scale distributed data. Apache Spark is a unified analytics engine for big data processing; we present a new algorithm FastMFDs based on Spark for discovering all MFDs from large-scale distributed data in parallel. FastMFDs uses both the RDD framework and the DataFrame framework to store and process distributed data. FastMFDs deletes equivalent attributes. FastMFDs also provides two-way search algorithm for searching and pruning. We experimented our algorithm on real-life datasets, and our algorithm is more efficient and faster than the existing discovering methods.
Similar content being viewed by others
References
Abiteboul S, Hull R, Vianu V (1995) Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston
Amshakala K, Nedunchezhian R, Rajalakshmi M (2014) Extracting functional dependencies in large datasets using mapreduce model. Int J Intell Inf Technol (IJIIT) 10(3):19–35
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516
Beeri C, Dowd M, Fagin R, Statman R (1984) On the structure of armstrong relations for functional dependencies. J ACM (JACM) 31(1):30–46
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111
Li W, Li Z, Chen Q, Jiang T, Liu H, Pan W (2015) Functional dependencies discovering in distributed big data. J Comput Res Dev 52(2):282–294
Lopes S, Petit J-M, Lakhal L (2000) Efficient discovery of functional dependencies and Armstrong relations. In: International Conference on Extending Database Technology. Springer, pp 350–364
Novelli N, Cicchetti R (2001) Fun: an efficient algorithm for mining functional and embedded dependencies. In: ICDT, vol 1. Springer, pp 189–203
Özsu MT, Valduriez P (2011) Principles of distributed database systems. Springer, Berlin
Stoica I (2014) Conquering big data with spark and bdas. ACM SIGMETRICS Perform Eval Rev 42(1):193–193
Taran V, Alienin O, Stirenko S, Gordienko Y, Rojbi A (2017) Performance evaluation of distributed computing environments with hadoop and spark frameworks. arXiv preprint arXiv:1707.04939
Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: International Conference on Data Warehousing and Knowledge Discovery. Springer, pp 101–110
Yao H, Hamilton H, Butz C (2002) Fd\(\_\)mine: discovering functional dependencies in a database using equivalences, Canada. In: IEEE ICDM, pp 1–15
Ye F, Liu J, Qian J, Xue X (2010) A framework for mining functional dependencies from large distributed databases. In: 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol 3. IEEE, pp 109–113
Acknowledgements
This study was funded by NFSC (Grant No. 61373164).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheng, F., Yang, Z. FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark. J Supercomput 75, 2497–2517 (2019). https://doi.org/10.1007/s11227-018-2643-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2643-8