skip to main content
10.1145/2882903.2915203acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A Hybrid Approach to Functional Dependency Discovery

Published:14 June 2016Publication History

ABSTRACT

Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.

References

  1. Z. Abedjan, P. Schulze, and F. Naumann. DFD: Efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bohannon, W. Fan, and F. Geerts. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. C. R. Carlson, A. K. Arora, and M. M. Carlson. The application of functional dependency theory to relational databases. Computer Journal, 25(1):68--73, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377--387, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. F. Codd. Further normalization of the data base relational model. IBM Research Report, San Jose, California, RJ909, 1971.Google ScholarGoogle Scholar
  7. G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang. Estimating the confidence of conditional functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 469--482, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. S. Cosmadakis, P. C. Kanellakis, and N. Spyratos. Partition semantics for relations. Journal of Computer and System Sciences, 33(2):203--233, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Garnaud, N. Hanusse, S. Maabout, and N. Novelli. Parallel mining of dependencies. In Proceedings of the International Conference on High Performance Computing & Simulation (HPCS), pages 491--498, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  13. B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Li, Z. Li, Q. Chen, T. Jiang, and H. Liu. Discovering functional dependencies in vertically distributed big data. Proceedings of the International Conference on Web Information Systems Engineering (WISE), pages 199--207, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  15. J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data -- a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and Armstrong relations. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 350--364, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. J. Miller, M. A. Hernandez, L. M. Haas, L.-L. Yan, H. Ho, R. Fagin, and L. Popa. The Clio project: Managing heterogeneity. SIGMOD Record, 30(1):78--83, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Novelli and R. Cicchetti. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. Proceedings of the VLDB Endowment, 8(12):1860--1871, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10):1082--1093, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering (TKDE), 27(5):1316--1329, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. N. Paulley. Exploiting functional dependence in query optimization. Technical report, University of Waterloo, 2000.Google ScholarGoogle Scholar
  23. J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Co., New York, NY, USA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Wyss, C. Giannella, and E. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In Proceedings of the International Conference of Data Warehousing and Knowledge Discovery (DaWaK), pages 101--110, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Yao, H. J. Hamilton, and C. J. Butz. FD\_Mine: discovering functional dependencies in a database using equivalences. In Proceedings of the International Conference on Data Mining (ICDM), pages 729--732, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Hybrid Approach to Functional Dependency Discovery

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
        June 2016
        2300 pages
        ISBN:9781450335317
        DOI:10.1145/2882903

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 June 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader