Skip to main content
Log in

Comparable dependencies over heterogeneous data

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

To study the data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDS), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDS), metric functional dependencies (MFDS), and matching dependencies (MDS). As we illustrated, comparable dependencies are useful in real practice of dataspaces, such as semantic query optimization. Due to heterogeneous data in dataspaces, the first question, known as the validation problem, is to tell whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, such as greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)

  2. Armstrong, W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)

  3. Bertossi L.E., Bravo L., Franconi E., Lopatenko A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst. 33(4-5), 407–434 (2008)

    Article  Google Scholar 

  4. Bitton, D., Millman, J., Torgersen, S.: A feasibility and performance study of dependency inference. In: ICDE, pp. 635–641 (1989)

  5. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

  6. Chakravarthy U.S., Grant J., Minker J.: Logic-based approach to semantic query optimization. ACM Trans. Database Syst. 15(2), 162–207 (1990)

    Article  Google Scholar 

  7. Cheng, Q., Gryz, J., Koo, F., Leung, T.Y.C., Liu, L., Qian, X., Schiefer, K.B.: Implementation of two semantic query optimization techniques in db2 universal database. In: VLDB, pp. 687–698 (1999)

  8. Chiang F., Miller R.J.: Discovering data quality rules. PVLDB 1(1), 1166–1177 (2008)

    Google Scholar 

  9. Chomicki J.: Semantic optimization techniques for preference queries. Inf. Syst. 32(5), 670–684 (2007)

    Article  Google Scholar 

  10. Chomicki J., Marcinkowski J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1-2), 90–121 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Cormode, G., Golab, L., Korn, F., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: SIGMOD Conference, pp. 469–482 (2009)

  12. Dinur, I., Safra, S.: The importance of being biased. In: STOC, pp. 33–42 (2002)

  13. Dong, X., Halevy, A.Y.: Indexing dataspaces. In: SIGMOD Conference, pp. 43–54 (2007)

  14. Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  15. Fan, W.: Dependencies revisited for improving data quality. In: PODS, pp. 159–170 (2008)

  16. Fan, W., Geerts, F., Lakshmanan, L.V.S., Xiong, M.: Discovering conditional functional dependencies. In: ICDE, pp. 1231–1234 (2009)

  17. Fan, W., Li, J., Jia, X., Ma, S.: Reasoning about record matching rules. In: PVLDB (2009)

  18. Feige, U., Goldwasser, S., Lovász, L., Safra, S., Szegedy, M.: Approximating clique is almost np-complete (preliminary version). In: FOCS, pp. 2–12 (1991)

  19. Flach P.A., Savnik I.: Database dependency discovery: a machine learning approach. AI Commun. 12(3), 139–160 (1999)

    MathSciNet  Google Scholar 

  20. Garey M.R., Johnson D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, London (1979)

    MATH  Google Scholar 

  21. Giannella C., Robertson E.L.: On approximation measures for functional dependencies. Inf. Syst. 29(6), 483–507 (2004)

    Article  Google Scholar 

  22. Golab L., Karloff H.J., Korn F., Srivastava D., Yu B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)

    Google Scholar 

  23. Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)

  24. Halldórsson, M.M., Radhakrishnan, J.: Greed is good: approximating independent sets in sparse and bounded-degree graphs. In: STOC, pp. 439–448 (1994)

  25. Hsu C.N., Knoblock C.A.: Semantic query optimization for query plans of heterogeneous multidatabase systems. IEEE Trans. Knowl. Data Eng. 12(6), 959–978 (2000)

    Article  Google Scholar 

  26. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Efficient discovery of functional and approximate dependencies using partitions. In: ICDE, pp. 392–401 (1998)

  27. Huhtala Y., Kärkkäinen J., Porkka P., Toivonen H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  MATH  Google Scholar 

  28. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD Conference, pp. 647–658 (2004)

  29. Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD Conference, pp. 847–860 (2008)

  30. Karakostas G.: A better approximation ratio for the vertex cover problem. ACM Trans. Algorithm. 5(4), 1–8 (2009). doi:10.1145/1597036.1597045

    Article  MathSciNet  Google Scholar 

  31. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, Plenum Press, Berlin, pp. 85–103 (1972)

  32. King R.S., Legendre J.J.: Discovery of functional and approximate functional dependencies in relational databases. JAMDS 7(1), 49–59 (2003)

    MathSciNet  MATH  Google Scholar 

  33. Kivinen J., Mannila H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129–149 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  34. Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)

  35. Kramer, S., Pfahringer, B.: Efficient search for strong partial determinations. In: KDD, pp. 371–374 (1996)

  36. Levy, A.Y., Sagiv, Y.: Semantic query optimization in datalog programs. In: PODS, pp. 163–173 (1995)

  37. Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: CIDR, pp. 342–350 (2007)

  38. Mannila, H., Räihä, K.J.: Dependency inference. In: VLDB, pp. 155–158 (1987)

  39. Mannila H., Räihä K.J.: Design of Relational Databases. Addison-Wesley, Boston (1992)

    MATH  Google Scholar 

  40. Mannila H., Räihä K.J.: Algorithms for inferring functional dependencies from relations. Data Knowl. Eng. 12(1), 83–99 (1994)

    Article  MATH  Google Scholar 

  41. Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  42. Parnas M., Ron D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1-3), 183–196 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  43. Pfahringer, B., Kramer, S.: Compression-based evaluation of partial determinations. In: KDD, pp. 234–239 (1995)

  44. Rahm E., Bernstein P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  45. Salles, M.A.V., Dittrich, J., Blunschi, L.: Intensional associations in dataspaces. In: ICDE (2010)

  46. Salles, M.A.V., Dittrich, J.P., Karakashian, S.K., Girard, O.R., Blunschi, L.: Itrails: pay-as-you-go information integration in dataspaces. In: VLDB, pp. 663–674 (2007)

  47. Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)

  48. Song S., Chen L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)

    Article  Google Scholar 

  49. Song, S., Chen, L., Cheng, H.: Parameter-free determination of distance thresholds for metric distance constraints. In: ICDE (2012, to appear)

  50. Song, S., Chen, L., Yu, P.S.: On data dependencies in dataspaces. In: ICDE, pp. 470–481 (2011)

  51. Song S., Chen L., Yuan M.: Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng. 23(12), 1872–1887 (2011)

    Article  Google Scholar 

  52. Su, H., Rundensteiner, E.A., Mani, M.: Semantic query optimization for xquery over xml streams. In: VLDB, pp. 277–288 (2005)

  53. Wang, D.Z., Dong, X.L., Sarma, A.D., Franklin, M.J., Halevy, A.Y.: Functional dependency generation and applications in pay-as-you-go data integration systems. In: WebDB (2009)

  54. Wyss, C.M., Giannella, C., Robertson, E.L.: Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances-extended abstract. In: DaWaK, pp. 101–110 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, S., Chen, L. & Yu, P.S. Comparable dependencies over heterogeneous data. The VLDB Journal 22, 253–274 (2013). https://doi.org/10.1007/s00778-012-0285-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0285-7

Keywords

Navigation