Skip to main content
Log in

Incrementally updating unary inclusion dependencies in dynamic data

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data: a white paper prepared for the computing community consortium committee of the computing research association. Tech. Rep. (2012). http://cra.org/ccc/resources/ccc-led-whitepapers/. Accessed 19 Oct 2017

  2. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, pp. 159–170 (2008)

  3. Naumann, F.: Data profiling revisited. SIGMOD Rec. 42(4), 40–49 (2013)

    Article  Google Scholar 

  4. Saha, B., Srivastava, D.: Data quality: the other face of big data. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297 (2014)

  5. Smith, K.P., Seligman, L.J., Rosenthal, A., Kurcz, C., Greer, M., Macheret, C., Sexton, M., Eckstein, A.: “Big metadata”: the need for principled metadata management in big data ecosystems. In: Proceedings of the Third Workshop on Data analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In Conjunction with ACM SIGMOD/PODS Conference, pp. 13:1–13:4 (2014)

  6. Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)

    Article  Google Scholar 

  7. Casanova, M.A., Tucherman, L., Furtado, A.L.: Enforcing inclusion dependencies and referencial integrity. In: Proceedings of the 14th International Conference on Very Large Data Bases (VLDB ’88), pp. 38–49 (1988)

  8. Gryz, J.: Query folding with inclusion dependencies. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, February 23–27, 1998, pp. 126–133 (1998)

  9. Levene, M., Vincent, M.W.: Justification for inclusion dependency normal form. IEEE Trans. Knowl. Data Eng. 12(2), 281–291 (2000)

    Article  Google Scholar 

  10. Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. PVLDB 3(1), 805–814 (2010)

    Google Scholar 

  11. Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), p. 2 (2006)

  12. DeMarchi, F., Lopes, S., Petit, J.: Unary and n-ary inclusion dependency discovery in relational databases. J. Intell. Inf. Syst. 32(1), 53–73 (2009)

    Article  Google Scholar 

  13. Papenbrock, T., Kruse, S., Quiané-Ruiz, J., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)

    Google Scholar 

  14. Shaabani, N., Meinel, C.: Scalable inclusion dependency discovery. In: Database Systems for Advanced Applications—20th International Conference, DASFAA 2015, Hanoi, Vietnam, April 20–23, 2015, Proceedings, Part I, pp. 425–440 (2015)

  15. DeMarchi, F., Petit, J.: Zigzag: a new algorithm for mining large inclusion dependencies in database. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19–22 December 2003, Melbourne, Florida, USA, pp. 27–34 (2003)

  16. Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 683–685 (2003)

  17. Shaabani, N., Meinel, C.: Detecting maximum inclusion dependencies without candidate generation. In: Database and Expert Systems Applications—27th International Conference, DEXA 2016, Porto, Portugal, September 5–8, 2016, Proceedings, Part II, pp. 118–133 (2016)

  18. Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data—a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)

    Article  Google Scholar 

  19. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  20. Shaabani, N., Meinel, C.: Incremental discovery of inclusion dependencies. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 2:1–2:12 (2017)

  21. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)

    Google Scholar 

  22. Newman, S.: Building microservices—designing fine-grained systems, 1st edn. O’Reilly, Sebastopol (2015)

    Google Scholar 

  23. Renz, J., Navarro-Suarez, G., Sathi, R., Staubitz, T., Meinel, C.: Enabling schema agnostic learning analytics in a service-oriented MOOC platform. In: Proceedings of the Third ACM Conference on Learning @ Scale, L@S 2016, Edinburgh, Scotland, UK, April 25–26, 2016, pp. 137–140 (2016)

  24. Evoke Software Data profiling and mapping. The essential first step in data migration and integration projects. Tech. Rep. (2000). http://ciains.info/elearning/Solutions/ANew/DataMigrationFirstSteps.pdf. Accessed 19 Oct 2017

  25. Kleppmann, M.: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly, Sebastopol (2016)

    Google Scholar 

  26. Das, S., Botev, C., Surlaker, K., Ghosh, B., Varadarajan, B., Nagaraj, S., Zhang, D., Gao, L., Westerman, J., Ganti, P., Shkolnik, B., Topiwala, S., Pachev, A., Somasundaram, N., Subramaniam, S.: All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: ACM Symposium on Cloud Computing (SOCC ’12), San Jose, CA, USA, October 14–17, 2012, p. 18 (2012)

  27. Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S., Kumar, S., Li, H.C., Li, J., Makeev, E., Prakasam, K., van Renesse, R., Roy, S., Seth, P., Song, Y.J., Wester, B., Veeraraghavan, K., Xie, P.: Wormhole: reliable pub-sub to support geo-replicated internet services. In: 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4–6, 2015, pp. 351–366 (2015)

  28. Kille, B., Hopfgartner, F., Brodt, T., Heintz, T.: The plista dataset. In: Proceedings of the 2013 International News Recommender Systems Workshop and Challenge, NRS ’13, pp. 16–23 (2013)

  29. Bell, S., Brockhausen, P.: Discovery of data dependencies in relational databases. Tech. Rep. Universität Dortmund (1995)

  30. Kantola, M., Mannila, H., Räihä, K., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. Int. J. Intell. Syst. 7(7), 591–607 (1992)

    Article  MATH  Google Scholar 

  31. Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), August 24–26, 2016, Aarhus, Denmark, pp. 6:1–6:13 (2016)

  32. Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pp. 240–251 (2002)

  33. DeMarchi, F., Petit, J.: Approximating a set of approximate inclusion dependencies. In: Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’05 Conference held in Gdansk, Poland, June 13-16, 2005, pp. 633–640 (2005)

  34. Koeller, A., Rundensteiner, E.A.: Heuristic strategies for inclusion dependency discovery. In: On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, OTM Confederated International Conferences, Agia Napa, Cyprus, October 25-29, 2004, Proceedings, Part II, pp. 891–908

  35. Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 6-10, März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2004)

  36. Lopes, S., Petit, J., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27(1), 1–19 (2002)

    Article  MATH  Google Scholar 

  37. Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009 (2009)

  38. Memari, M., Link, S., Dobbie, G.: SQL data profiling of foreign keys. In: Proceedings of the Conceptual Modeling–34th International Conference, ER 2015, Stockholm, Sweden, October 19–22, 2015, pp. 229–243 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuhad Shaabani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shaabani, N., Meinel, C. Incrementally updating unary inclusion dependencies in dynamic data. Distrib Parallel Databases 37, 133–176 (2019). https://doi.org/10.1007/s10619-018-7233-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7233-5

Keywords

Navigation