skip to main content
10.1145/1559845.1559895acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Estimating the confidence of conditional functional dependencies

Published: 29 June 2009 Publication History

Abstract

Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large.
We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.

References

[1]
M. Arlitt and T. Jin. 1998 world cup web site access logs. http://www.acm.org/sigcomm/ITA/, 1998.
[2]
L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha. Simpler algorithm for estimating frequency moments of data streams. In ACM-SIAM Symposium on Discrete Algorithms, 2006.
[3]
L. Bravo, W. Fan, F. Geerts, and S. Ma. Increasing the expressivity of conditional functional dependencies without extra complexity. In IEEE International Conference on Data Engineering, 2008.
[4]
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In International Conference on Very Large Data Bases, 2007.
[5]
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In ACM Symposium on Theory of Computing, 1998.
[6]
Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. CoRR, abs/0902.2399, 2009.
[7]
P. Brown and P. Haas. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In International Conference on Very Large Data Bases, 2003.
[8]
P. Brown and P. Haas. Techniques for warehousing of sample data. In IEEE International Conference on Data Engineering, 2006.
[9]
B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In ACM SIGKDD, 2002.
[10]
F. Chiang and R. Miller. Discovering data quality rules. In International Conference on Very Large Data Bases, 2008.
[11]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In International Conference on Very Large Data Bases, 2007.
[12]
G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In ACM Principles of Database Systems, 2005.
[13]
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SIAM Conference on Data Mining, 2005.
[14]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst., 33(2), 2008.
[15]
W. Fan, F. Geerts, L. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In IEEE International Conference on Data Engineering, 2009.
[16]
W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu. Propagating functional dependencies with conditions. In International Conference on Very Large Data Bases, 2008.
[17]
R. Gemulla, W. Lehner, and P. Haas. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In International Conference on Very Large Data Bases, 2006.
[18]
Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. On generating near-optimal tableaux for conditional functional dependencies. In International Conference on Very Large Data Bases, 2008.
[19]
Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
[20]
I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In ACM SIGMOD International Conference on Management of Data, 2004.
[21]
T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129--135, 2008.
[22]
J. Kivenen and H. Mannila. The power of sampling in knowledge discovery. In ACM Principles of Database Systems, 1994.
[23]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1), 1995.
[24]
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[25]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
[26]
H. Toivonen. Sampling large databases for association rules. In International Conference on Very Large Data Bases, 1996.
[27]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, March 1985.

Cited By

View all
  • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
  • (2024)Mining CFD Algorithm on Medium Dataset Using Queue2024 IEEE 6th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA)10.1109/ICCCMLA63077.2024.10871581(90-95)Online publication date: 19-Oct-2024
  • (2023)EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00220(2878-2891)Online publication date: Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. conditional functional dependencies

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '09
Sponsor:
SIGMOD/PODS '09: International Conference on Management of Data
June 29 - July 2, 2009
Rhode Island, Providence, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
  • (2024)Mining CFD Algorithm on Medium Dataset Using Queue2024 IEEE 6th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA)10.1109/ICCCMLA63077.2024.10871581(90-95)Online publication date: 19-Oct-2024
  • (2023)EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00220(2878-2891)Online publication date: Apr-2023
  • (2023)Discovering Editing Rules by Deep Reinforcement Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00034(355-367)Online publication date: Apr-2023
  • (2023)Functional Dependencies with Predicates: What Makes the g3-error Easy to Compute?Graph-Based Representation and Reasoning10.1007/978-3-031-40960-8_1(3-16)Online publication date: 11-Sep-2023
  • (2022)Assessing the Existence of a Function in a Dataset with the g3 Indicator2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00050(607-620)Online publication date: May-2022
  • (2022)Foundations of Data Quality Management10.1007/978-3-031-01892-3Online publication date: 2-Mar-2022
  • (2022)Data ProfilingundefinedOnline publication date: 25-Feb-2022
  • (2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
  • (2019)Discovery of MicroDependenciesIEEE Access10.1109/ACCESS.2019.29108437(50198-50213)Online publication date: 2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media