Abstract
Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.









Similar content being viewed by others
References
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data: a white paper prepared for the computing community consortium committee of the computing research association. Tech. Rep. (2012). http://cra.org/ccc/resources/ccc-led-whitepapers/. Accessed 19 Oct 2017
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, pp. 159–170 (2008)
Naumann, F.: Data profiling revisited. SIGMOD Rec. 42(4), 40–49 (2013)
Saha, B., Srivastava, D.: Data quality: the other face of big data. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297 (2014)
Smith, K.P., Seligman, L.J., Rosenthal, A., Kurcz, C., Greer, M., Macheret, C., Sexton, M., Eckstein, A.: “Big metadata”: the need for principled metadata management in big data ecosystems. In: Proceedings of the Third Workshop on Data analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In Conjunction with ACM SIGMOD/PODS Conference, pp. 13:1–13:4 (2014)
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)
Casanova, M.A., Tucherman, L., Furtado, A.L.: Enforcing inclusion dependencies and referencial integrity. In: Proceedings of the 14th International Conference on Very Large Data Bases (VLDB ’88), pp. 38–49 (1988)
Gryz, J.: Query folding with inclusion dependencies. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, February 23–27, 1998, pp. 126–133 (1998)
Levene, M., Vincent, M.W.: Justification for inclusion dependency normal form. IEEE Trans. Knowl. Data Eng. 12(2), 281–291 (2000)
Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. PVLDB 3(1), 805–814 (2010)
Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), p. 2 (2006)
DeMarchi, F., Lopes, S., Petit, J.: Unary and n-ary inclusion dependency discovery in relational databases. J. Intell. Inf. Syst. 32(1), 53–73 (2009)
Papenbrock, T., Kruse, S., Quiané-Ruiz, J., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)
Shaabani, N., Meinel, C.: Scalable inclusion dependency discovery. In: Database Systems for Advanced Applications—20th International Conference, DASFAA 2015, Hanoi, Vietnam, April 20–23, 2015, Proceedings, Part I, pp. 425–440 (2015)
DeMarchi, F., Petit, J.: Zigzag: a new algorithm for mining large inclusion dependencies in database. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19–22 December 2003, Melbourne, Florida, USA, pp. 27–34 (2003)
Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 683–685 (2003)
Shaabani, N., Meinel, C.: Detecting maximum inclusion dependencies without candidate generation. In: Database and Expert Systems Applications—27th International Conference, DEXA 2016, Porto, Portugal, September 5–8, 2016, Proceedings, Part II, pp. 118–133 (2016)
Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data—a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Shaabani, N., Meinel, C.: Incremental discovery of inclusion dependencies. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 2:1–2:12 (2017)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
Newman, S.: Building microservices—designing fine-grained systems, 1st edn. O’Reilly, Sebastopol (2015)
Renz, J., Navarro-Suarez, G., Sathi, R., Staubitz, T., Meinel, C.: Enabling schema agnostic learning analytics in a service-oriented MOOC platform. In: Proceedings of the Third ACM Conference on Learning @ Scale, L@S 2016, Edinburgh, Scotland, UK, April 25–26, 2016, pp. 137–140 (2016)
Evoke Software Data profiling and mapping. The essential first step in data migration and integration projects. Tech. Rep. (2000). http://ciains.info/elearning/Solutions/ANew/DataMigrationFirstSteps.pdf. Accessed 19 Oct 2017
Kleppmann, M.: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly, Sebastopol (2016)
Das, S., Botev, C., Surlaker, K., Ghosh, B., Varadarajan, B., Nagaraj, S., Zhang, D., Gao, L., Westerman, J., Ganti, P., Shkolnik, B., Topiwala, S., Pachev, A., Somasundaram, N., Subramaniam, S.: All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: ACM Symposium on Cloud Computing (SOCC ’12), San Jose, CA, USA, October 14–17, 2012, p. 18 (2012)
Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S., Kumar, S., Li, H.C., Li, J., Makeev, E., Prakasam, K., van Renesse, R., Roy, S., Seth, P., Song, Y.J., Wester, B., Veeraraghavan, K., Xie, P.: Wormhole: reliable pub-sub to support geo-replicated internet services. In: 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4–6, 2015, pp. 351–366 (2015)
Kille, B., Hopfgartner, F., Brodt, T., Heintz, T.: The plista dataset. In: Proceedings of the 2013 International News Recommender Systems Workshop and Challenge, NRS ’13, pp. 16–23 (2013)
Bell, S., Brockhausen, P.: Discovery of data dependencies in relational databases. Tech. Rep. Universität Dortmund (1995)
Kantola, M., Mannila, H., Räihä, K., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. Int. J. Intell. Syst. 7(7), 591–607 (1992)
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), August 24–26, 2016, Aarhus, Denmark, pp. 6:1–6:13 (2016)
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pp. 240–251 (2002)
DeMarchi, F., Petit, J.: Approximating a set of approximate inclusion dependencies. In: Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’05 Conference held in Gdansk, Poland, June 13-16, 2005, pp. 633–640 (2005)
Koeller, A., Rundensteiner, E.A.: Heuristic strategies for inclusion dependency discovery. In: On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, OTM Confederated International Conferences, Agia Napa, Cyprus, October 25-29, 2004, Proceedings, Part II, pp. 891–908
Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 6-10, März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2004)
Lopes, S., Petit, J., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27(1), 1–19 (2002)
Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009 (2009)
Memari, M., Link, S., Dobbie, G.: SQL data profiling of foreign keys. In: Proceedings of the Conceptual Modeling–34th International Conference, ER 2015, Stockholm, Sweden, October 19–22, 2015, pp. 229–243 (2015)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shaabani, N., Meinel, C. Incrementally updating unary inclusion dependencies in dynamic data. Distrib Parallel Databases 37, 133–176 (2019). https://doi.org/10.1007/s10619-018-7233-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7233-5