skip to main content
10.1145/3269206.3271724acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Improving the Efficiency of Inclusion Dependency Detection

Published: 17 October 2018 Publication History

Abstract

The detection of all inclusion dependencies (INDs) in an unknown dataset is at the core of any data profiling effort. Apart from the discovery of foreign key relationships, INDs can help perform data integration, integrity checking, schema (re-)design, and query optimization. With the advent of Big Data, the demand increases for efficient INDs discovery algorithms that can scale with the input data size. To this end, we propose S-indd++ as a scalable system for detecting unary INDs in large datasets. S-indd++ applies a new stepwise partitioning technique that helps discard a large number of attributes in early phases of the detection by processing the first partitions of smaller sizes. S-indd++ also extends the concept of the attribute clustering to decide which attributes to be discarded based on the clustering result of each partition. Moreover, in contrast to the state-of-the-art, S-indd++ does not require the partition to fit into the main memory- which is a highly appreciable property in the face of the ever growing datasets. We conducted an exhaustive evaluation of S-indd ++ by applying it to large datasets with thousands attributes and more than 266 million tuples. The results show the high superiority of S-indd++ over the state-of-the-art. S-indd++ reduced up to 50~% of the runtime in comparison with Binder, and up to 98~% in comparison with S-indd.

References

[1]
Z Abedjan, L. Golab, and F. Naumann. 2015. Profiling Relational Data: A Survey. The VLDB Journal, Vol. 24, 4 (Aug. 2015), 557--581.
[2]
D. Agrawal, P. Bernstein, Bertino E., Davidson S., Dayal U., Franklin M., Gehrke J., Haas L., Halevy A., Han J., Jagadish H. V., Labrinidis A., Madden S., Papakonstantinou Y., Patel J. M., Ramakrishnan R., Ross K., Shahabi C., Suciu D., Vaithyanathan S., and Widom J. 2012. Challenges and Opportunities with Big Data: A white paper prepared for the Computing Community Consortium committee of the Computing Research Association . Technical Report. http://cra.org/ccc/resources/ccc-led-whitepapers/ Accessed on 24.08.2018.
[3]
J. Bauckmann, U. Leser, and F. Naumann. 2006. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006 .
[4]
M. A. Casanova, L. Tucherman, and A. L. Furtado. 1988. Enforcing Inclusion Dependencies and Referencial Integrity. In VLDB 1988 (VLDB '88). 38--49.
[5]
Evoke Software. 2000. Data Profiling and Mapping. The Essential First Step in Data Migration and Integration Projects . Technical Report. http://ciains.info/elearning/Solutions/ANew/DataMigrationFirstSteps.pdf Accessed on 24.08.2018.
[6]
W. Fan. 2008. Dependencies Revisited for Improving Data Quality. In PODS (PODS '08). ACM, New York, NY, USA, 159--170.
[7]
J. Gryz. 1998. Query Folding with Inclusion Dependencies. In ICDE 1998. 126--133.
[8]
B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. 2013. The Plista Dataset. In NRS Workshops (NRS '13). 16--23.
[9]
A. Koeller and E.A. Rundensteiner. 2003. Discovery of high-dimensional inclusion dependencies. In ICDE 2003 . 683--685.
[10]
M. Levene and M. W. Vincent. 2000. Justification for inclusion dependency normal form. TKDE, Vol. 12 (2000), 2000.
[11]
F. D. Marchi and J.-M. Petit . 2003. Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases. In ICDM, 2003 (ICDM). 27--34.
[12]
F. D. Marchi, S. Lopes, and J.-M. Petit. 2009. Unary and n-ary inclusion dependency discovery in relational databases. JIIS, Vol. 32, 1 (2009), 53--73.
[13]
M. Memari, S. Link, and G. Dobbie. 2015. Conceptual Modeling: ER 2015 . Chapter SQL Data Profiling of Foreign Keys, 229--243.
[14]
R. J. Miller, M. A. Hernández, L. M. Haas, L.-L. Yan, C. T. Howard Ho, R. Fagin, and L. Popa. 2001. The Clio Project: Managing Heterogeneity. SIGMOD Rec., 2001, Vol. 30, 1 (2001), 78--83.
[15]
F. Naumann. 2014. Data Profiling Revisited. SIGMOD Rec., Vol. 42, 4 (2014), 40--49.
[16]
T. Papenbrock, S. Kruse, J.-A. Quiane-Ruiz, and F. Naumann. 2015. Divide & Conquer-based Inclusion Dependency Discovery. VLDB, 2015, Vol. 8, 7 (0 2015), 774--785.
[17]
B. Saha and D. Srivastava. 2014. Data quality: The other face of Big Data. In ICDE 2014. 1294--1297.
[18]
N. Shaabani and C. Meinel. 2015. Scalable Inclusion Dependency Discovery. In DASFAA 215. LNCS, Vol. 9049. 425--440.
[19]
N. Shaabani and C. Meinel. 2016. Detecting Maximum Inclusion Dependencies without Candidate Generation. In DEXA 2016, Part II. 118--133.
[20]
N. Shaabani and C. Meinel. 2017. Incremental Discovery of Inclusion Dependencies. In SSDBM 2017. ACM, Chicago, IL, USA, 2:1--2:12.
[21]
N. Shaabani and C. Meinel. 2018. Incrementally updating unary inclusion dependencies in dynamic data. Distributed and Parallel Databases (8 2018), 1--44.
[22]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C M. Procopiuc, and D. Srivastava. 2010. On Multi-column Foreign Key Discovery. VLDB, Vol. 3, 1--2 (sep 2010), 805--814.

Cited By

View all
  • (2023)Fast Discovery of Inclusion Dependencies with Desbordante2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143047(264-275)Online publication date: 24-May-2023
  • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
  • (2022)New Trends in Big Data ProfilingIntelligent Computing10.1007/978-3-031-10461-9_55(808-825)Online publication date: 7-Jul-2022
  • Show More Cited By

Index Terms

  1. Improving the Efficiency of Inclusion Dependency Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
    October 2018
    2362 pages
    ISBN:9781450360142
    DOI:10.1145/3269206
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. algorithms
    2. data mining
    3. data partitioning
    4. data profiling

    Qualifiers

    • Research-article

    Conference

    CIKM '18
    Sponsor:

    Acceptance Rates

    CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Fast Discovery of Inclusion Dependencies with Desbordante2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143047(264-275)Online publication date: 24-May-2023
    • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
    • (2022)New Trends in Big Data ProfilingIntelligent Computing10.1007/978-3-031-10461-9_55(808-825)Online publication date: 7-Jul-2022
    • (2019)Inclusion Dependency DiscoveryProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357916(219-228)Online publication date: 3-Nov-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media