ABSTRACT
Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
- Z. Abedjan, P. Schulze, and F. Naumann. DFD: Efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994. Google ScholarDigital Library
- P. Bohannon, W. Fan, and F. Geerts. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.Google ScholarCross Ref
- C. R. Carlson, A. K. Arora, and M. M. Carlson. The application of functional dependency theory to relational databases. Computer Journal, 25(1):68--73, 1982.Google ScholarCross Ref
- E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377--387, 1970. Google ScholarDigital Library
- E. F. Codd. Further normalization of the data base relational model. IBM Research Report, San Jose, California, RJ909, 1971.Google Scholar
- G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang. Estimating the confidence of conditional functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 469--482, 2009. Google ScholarDigital Library
- S. S. Cosmadakis, P. C. Kanellakis, and N. Spyratos. Partition semantics for relations. Journal of Computer and System Sciences, 33(2):203--233, 1986. Google ScholarDigital Library
- P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999. Google ScholarDigital Library
- E. Garnaud, N. Hanusse, S. Maabout, and N. Novelli. Parallel mining of dependencies. In Proceedings of the International Conference on High Performance Computing & Simulation (HPCS), pages 491--498, 2014.Google ScholarCross Ref
- M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google ScholarCross Ref
- B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems, 2013. Google ScholarDigital Library
- W. Li, Z. Li, Q. Chen, T. Jiang, and H. Liu. Discovering functional dependencies in vertically distributed big data. Proceedings of the International Conference on Web Information Systems Engineering (WISE), pages 199--207, 2015.Google ScholarCross Ref
- J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data -- a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012. Google ScholarDigital Library
- S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and Armstrong relations. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 350--364, 2000. Google ScholarDigital Library
- R. J. Miller, M. A. Hernandez, L. M. Haas, L.-L. Yan, H. Ho, R. Fagin, and L. Popa. The Clio project: Managing heterogeneity. SIGMOD Record, 30(1):78--83, 2001. Google ScholarDigital Library
- N. Novelli and R. Cicchetti. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarDigital Library
- T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. Proceedings of the VLDB Endowment, 8(12):1860--1871, 2015. Google ScholarDigital Library
- T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10):1082--1093, 2015. Google ScholarDigital Library
- T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering (TKDE), 27(5):1316--1329, 2015.Google ScholarDigital Library
- G. N. Paulley. Exploiting functional dependence in query optimization. Technical report, University of Waterloo, 2000.Google Scholar
- J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Co., New York, NY, USA, 1990. Google ScholarDigital Library
- C. Wyss, C. Giannella, and E. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In Proceedings of the International Conference of Data Warehousing and Knowledge Discovery (DaWaK), pages 101--110, 2001. Google ScholarDigital Library
- H. Yao, H. J. Hamilton, and C. J. Butz. FD\_Mine: discovering functional dependencies in a database using equivalences. In Proceedings of the International Conference on Data Mining (ICDM), pages 729--732, 2002. Google ScholarDigital Library
Index Terms
- A Hybrid Approach to Functional Dependency Discovery
Recommendations
Towards the efficient discovery of meaningful functional dependencies
AbstractWe propose the first framework for discovering the set of meaningful functional dependencies from data. This set contains the true positives among the set of functional dependencies that hold on the given data. Based on new data ...
Efficient Discovery of Functional Dependencies from Incremental Databases
iiWAS2021: The 23rd International Conference on Information Integration and Web IntelligenceWith the advent of Big Data there is an increasing necessity to incrementally mine information from data originating from sensors and other dynamic sources. Thus, it is necessary to devise algorithms capable of mining useful information upon possible ...
Efficient Discovery of Matching Dependencies
Best of ICDT 2019 and Regular PapersMatching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their ...
Comments