Skip to main content
Log in

FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal functional dependencies (MFDs) from relational databases is an important database analysis technique. However, as the data grows larger and larger in size, even the most efficient stand-alone algorithms are exponential in the number of attributes of the relations. Discovering MFDs on a single computer is hard and slow, and it can only be applied to small centralized datasets. It is challenging to discover MFDs from big data, especially large-scale distributed data. Apache Spark is a unified analytics engine for big data processing; we present a new algorithm FastMFDs based on Spark for discovering all MFDs from large-scale distributed data in parallel. FastMFDs uses both the RDD framework and the DataFrame framework to store and process distributed data. FastMFDs deletes equivalent attributes. FastMFDs also provides two-way search algorithm for searching and pruning. We experimented our algorithm on real-life datasets, and our algorithm is more efficient and faster than the existing discovering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abiteboul S, Hull R, Vianu V (1995) Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  2. Amshakala K, Nedunchezhian R, Rajalakshmi M (2014) Extracting functional dependencies in large datasets using mapreduce model. Int J Intell Inf Technol (IJIIT) 10(3):19–35

    Article  Google Scholar 

  3. Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516

    Article  Google Scholar 

  4. Beeri C, Dowd M, Fagin R, Statman R (1984) On the structure of armstrong relations for functional dependencies. J ACM (JACM) 31(1):30–46

    Article  MathSciNet  MATH  Google Scholar 

  5. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  6. Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111

    Article  MATH  Google Scholar 

  7. Li W, Li Z, Chen Q, Jiang T, Liu H, Pan W (2015) Functional dependencies discovering in distributed big data. J Comput Res Dev 52(2):282–294

    Google Scholar 

  8. Lopes S, Petit J-M, Lakhal L (2000) Efficient discovery of functional dependencies and Armstrong relations. In: International Conference on Extending Database Technology. Springer, pp 350–364

  9. Novelli N, Cicchetti R (2001) Fun: an efficient algorithm for mining functional and embedded dependencies. In: ICDT, vol 1. Springer, pp 189–203

  10. Özsu MT, Valduriez P (2011) Principles of distributed database systems. Springer, Berlin

    Google Scholar 

  11. Stoica I (2014) Conquering big data with spark and bdas. ACM SIGMETRICS Perform Eval Rev 42(1):193–193

    Article  Google Scholar 

  12. Taran V, Alienin O, Stirenko S, Gordienko Y, Rojbi A (2017) Performance evaluation of distributed computing environments with hadoop and spark frameworks. arXiv preprint arXiv:1707.04939

  13. Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: International Conference on Data Warehousing and Knowledge Discovery. Springer, pp 101–110

  14. Yao H, Hamilton H, Butz C (2002) Fd\(\_\)mine: discovering functional dependencies in a database using equivalences, Canada. In: IEEE ICDM, pp 1–15

  15. Ye F, Liu J, Qian J, Xue X (2010) A framework for mining functional dependencies from large distributed databases. In: 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol 3. IEEE, pp 109–113

Download references

Acknowledgements

This study was funded by NFSC (Grant No. 61373164).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Cheng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, F., Yang, Z. FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark. J Supercomput 75, 2497–2517 (2019). https://doi.org/10.1007/s11227-018-2643-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2643-8

Keywords

Navigation