Incrementally updating unary inclusion dependencies in dynamic data

Shaabani, Nuhad; Meinel, Christoph

doi:10.1007/s10619-018-7233-5

Incrementally updating unary inclusion dependencies in dynamic data

Published: 01 August 2018

Volume 37, pages 133–176, (2019)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Nuhad Shaabani¹ &
Christoph Meinel¹

569 Accesses
Explore all metrics

Abstract

Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental Discovery of Order Dependencies on Tuple Insertions

Detecting Maximum Inclusion Dependencies without Candidate Generation

Incremental Schema Generation for Large and Evolving RDF Sources

References

Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data: a white paper prepared for the computing community consortium committee of the computing research association. Tech. Rep. (2012). http://cra.org/ccc/resources/ccc-led-whitepapers/. Accessed 19 Oct 2017
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, pp. 159–170 (2008)
Naumann, F.: Data profiling revisited. SIGMOD Rec. 42(4), 40–49 (2013)
Article Google Scholar
Saha, B., Srivastava, D.: Data quality: the other face of big data. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297 (2014)
Smith, K.P., Seligman, L.J., Rosenthal, A., Kurcz, C., Greer, M., Macheret, C., Sexton, M., Eckstein, A.: “Big metadata”: the need for principled metadata management in big data ecosystems. In: Proceedings of the Third Workshop on Data analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In Conjunction with ACM SIGMOD/PODS Conference, pp. 13:1–13:4 (2014)
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)
Article Google Scholar
Casanova, M.A., Tucherman, L., Furtado, A.L.: Enforcing inclusion dependencies and referencial integrity. In: Proceedings of the 14th International Conference on Very Large Data Bases (VLDB ’88), pp. 38–49 (1988)
Gryz, J.: Query folding with inclusion dependencies. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, February 23–27, 1998, pp. 126–133 (1998)
Levene, M., Vincent, M.W.: Justification for inclusion dependency normal form. IEEE Trans. Knowl. Data Eng. 12(2), 281–291 (2000)
Article Google Scholar
Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. PVLDB 3(1), 805–814 (2010)
Google Scholar
Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), p. 2 (2006)
DeMarchi, F., Lopes, S., Petit, J.: Unary and n-ary inclusion dependency discovery in relational databases. J. Intell. Inf. Syst. 32(1), 53–73 (2009)
Article Google Scholar
Papenbrock, T., Kruse, S., Quiané-Ruiz, J., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)
Google Scholar
Shaabani, N., Meinel, C.: Scalable inclusion dependency discovery. In: Database Systems for Advanced Applications—20th International Conference, DASFAA 2015, Hanoi, Vietnam, April 20–23, 2015, Proceedings, Part I, pp. 425–440 (2015)
DeMarchi, F., Petit, J.: Zigzag: a new algorithm for mining large inclusion dependencies in database. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19–22 December 2003, Melbourne, Florida, USA, pp. 27–34 (2003)
Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 683–685 (2003)
Shaabani, N., Meinel, C.: Detecting maximum inclusion dependencies without candidate generation. In: Database and Expert Systems Applications—27th International Conference, DEXA 2016, Porto, Portugal, September 5–8, 2016, Proceedings, Part II, pp. 118–133 (2016)
Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data—a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)
Article Google Scholar
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Article Google Scholar
Shaabani, N., Meinel, C.: Incremental discovery of inclusion dependencies. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 2:1–2:12 (2017)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
Google Scholar
Newman, S.: Building microservices—designing fine-grained systems, 1st edn. O’Reilly, Sebastopol (2015)
Google Scholar
Renz, J., Navarro-Suarez, G., Sathi, R., Staubitz, T., Meinel, C.: Enabling schema agnostic learning analytics in a service-oriented MOOC platform. In: Proceedings of the Third ACM Conference on Learning @ Scale, L@S 2016, Edinburgh, Scotland, UK, April 25–26, 2016, pp. 137–140 (2016)
Evoke Software Data profiling and mapping. The essential first step in data migration and integration projects. Tech. Rep. (2000). http://ciains.info/elearning/Solutions/ANew/DataMigrationFirstSteps.pdf. Accessed 19 Oct 2017
Kleppmann, M.: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly, Sebastopol (2016)
Google Scholar
Das, S., Botev, C., Surlaker, K., Ghosh, B., Varadarajan, B., Nagaraj, S., Zhang, D., Gao, L., Westerman, J., Ganti, P., Shkolnik, B., Topiwala, S., Pachev, A., Somasundaram, N., Subramaniam, S.: All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: ACM Symposium on Cloud Computing (SOCC ’12), San Jose, CA, USA, October 14–17, 2012, p. 18 (2012)
Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S., Kumar, S., Li, H.C., Li, J., Makeev, E., Prakasam, K., van Renesse, R., Roy, S., Seth, P., Song, Y.J., Wester, B., Veeraraghavan, K., Xie, P.: Wormhole: reliable pub-sub to support geo-replicated internet services. In: 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4–6, 2015, pp. 351–366 (2015)
Kille, B., Hopfgartner, F., Brodt, T., Heintz, T.: The plista dataset. In: Proceedings of the 2013 International News Recommender Systems Workshop and Challenge, NRS ’13, pp. 16–23 (2013)
Bell, S., Brockhausen, P.: Discovery of data dependencies in relational databases. Tech. Rep. Universität Dortmund (1995)
Kantola, M., Mannila, H., Räihä, K., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. Int. J. Intell. Syst. 7(7), 591–607 (1992)
Article MATH Google Scholar
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), August 24–26, 2016, Aarhus, Denmark, pp. 6:1–6:13 (2016)
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pp. 240–251 (2002)
DeMarchi, F., Petit, J.: Approximating a set of approximate inclusion dependencies. In: Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’05 Conference held in Gdansk, Poland, June 13-16, 2005, pp. 633–640 (2005)
Koeller, A., Rundensteiner, E.A.: Heuristic strategies for inclusion dependency discovery. In: On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, OTM Confederated International Conferences, Agia Napa, Cyprus, October 25-29, 2004, Proceedings, Part II, pp. 891–908
Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 6-10, März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2004)
Lopes, S., Petit, J., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27(1), 1–19 (2002)
Article MATH Google Scholar
Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009 (2009)
Memari, M., Link, S., Dobbie, G.: SQL data profiling of foreign keys. In: Proceedings of the Conceptual Modeling–34th International Conference, ER 2015, Stockholm, Sweden, October 19–22, 2015, pp. 229–243 (2015)

Download references

Author information

Authors and Affiliations

Hasso-Plattner-Institut, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
Nuhad Shaabani & Christoph Meinel

Authors

Nuhad Shaabani
View author publications
You can also search for this author inPubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nuhad Shaabani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaabani, N., Meinel, C. Incrementally updating unary inclusion dependencies in dynamic data. Distrib Parallel Databases 37, 133–176 (2019). https://doi.org/10.1007/s10619-018-7233-5

Download citation

Published: 01 August 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s10619-018-7233-5

Keywords

Part of a collection:

Special Issue on Scientific and Statistical Data Management

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incrementally updating unary inclusion dependencies in dynamic data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Incremental Discovery of Order Dependencies on Tuple Insertions

Detecting Maximum Inclusion Dependencies without Candidate Generation

Incremental Schema Generation for Large and Evolving RDF Sources

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now