FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

Cheng, Feng; Yang, Zhe

doi:10.1007/s11227-018-2643-8

FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

Published: 15 October 2018

Volume 75, pages 2497–2517, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

413 Accesses
5 Citations
Explore all metrics

Abstract

Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal functional dependencies (MFDs) from relational databases is an important database analysis technique. However, as the data grows larger and larger in size, even the most efficient stand-alone algorithms are exponential in the number of attributes of the relations. Discovering MFDs on a single computer is hard and slow, and it can only be applied to small centralized datasets. It is challenging to discover MFDs from big data, especially large-scale distributed data. Apache Spark is a unified analytics engine for big data processing; we present a new algorithm FastMFDs based on Spark for discovering all MFDs from large-scale distributed data in parallel. FastMFDs uses both the RDD framework and the DataFrame framework to store and process distributed data. FastMFDs deletes equivalent attributes. FastMFDs also provides two-way search algorithm for searching and pruning. We experimented our algorithm on real-life datasets, and our algorithm is more efficient and faster than the existing discovering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering Approximate Functional Dependencies from Distributed Big Data

Functional Dependency Discovery on Distributed Database: Sampling Verification Framework

Efficient distributed discovery of bidirectional order dependencies

Article Open access 16 August 2021

References

Abiteboul S, Hull R, Vianu V (1995) Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Amshakala K, Nedunchezhian R, Rajalakshmi M (2014) Extracting functional dependencies in large datasets using mapreduce model. Int J Intell Inf Technol (IJIIT) 10(3):19–35
Article Google Scholar
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516
Article Google Scholar
Beeri C, Dowd M, Fagin R, Statman R (1984) On the structure of armstrong relations for functional dependencies. J ACM (JACM) 31(1):30–46
Article MathSciNet MATH Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111
Article MATH Google Scholar
Li W, Li Z, Chen Q, Jiang T, Liu H, Pan W (2015) Functional dependencies discovering in distributed big data. J Comput Res Dev 52(2):282–294
Google Scholar
Lopes S, Petit J-M, Lakhal L (2000) Efficient discovery of functional dependencies and Armstrong relations. In: International Conference on Extending Database Technology. Springer, pp 350–364
Novelli N, Cicchetti R (2001) Fun: an efficient algorithm for mining functional and embedded dependencies. In: ICDT, vol 1. Springer, pp 189–203
Özsu MT, Valduriez P (2011) Principles of distributed database systems. Springer, Berlin
Google Scholar
Stoica I (2014) Conquering big data with spark and bdas. ACM SIGMETRICS Perform Eval Rev 42(1):193–193
Article Google Scholar
Taran V, Alienin O, Stirenko S, Gordienko Y, Rojbi A (2017) Performance evaluation of distributed computing environments with hadoop and spark frameworks. arXiv preprint arXiv:1707.04939
Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: International Conference on Data Warehousing and Knowledge Discovery. Springer, pp 101–110
Yao H, Hamilton H, Butz C (2002) Fd\(\_\)mine: discovering functional dependencies in a database using equivalences, Canada. In: IEEE ICDM, pp 1–15
Ye F, Liu J, Qian J, Xue X (2010) A framework for mining functional dependencies from large distributed databases. In: 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol 3. IEEE, pp 109–113

Download references

Acknowledgements

This study was funded by NFSC (Grant No. 61373164).

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, Jiangsu, China
Feng Cheng & Zhe Yang

Authors

Feng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, F., Yang, Z. FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark. J Supercomput 75, 2497–2517 (2019). https://doi.org/10.1007/s11227-018-2643-8

Download citation

Published: 15 October 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s11227-018-2643-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

Abstract

Access this article

Similar content being viewed by others

Discovering Approximate Functional Dependencies from Distributed Big Data

Functional Dependency Discovery on Distributed Database: Sampling Verification Framework

Efficient distributed discovery of bidirectional order dependencies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

Abstract

Access this article

Similar content being viewed by others

Discovering Approximate Functional Dependencies from Distributed Big Data

Functional Dependency Discovery on Distributed Database: Sampling Verification Framework

Efficient distributed discovery of bidirectional order dependencies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation