Skip to main content
Log in

The address connector: noninvasive synchronization of hierarchical data sources

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.caatoosee.com.

  2. http://www.acxiom.com.

References

  1. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases (VLDB), VLDB Endowment, pp 918–929

  2. Augsten N, Böhlen M, Dyreson C, Gamper J (2012) Windowed \(pq\)-grams for approximate joins of data-centric XML. VLDB J 21(4):463–488

    Article  Google Scholar 

  3. Augsten N, Böhlen M, Gamper J (2004) Reducing the integration of public administration databases to approximate tree matching. In: Electronic government—third international conference, LNCS 3183. Springer, Zaragoza, pp 102–107

  4. Augsten N, Böhlen M, Gamper J (2005) Approximate matching of hierarchical data using \(pq\)-grams. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Trondheim, pp 301–312

  5. Augsten N, Böhlen M, Gamper J (2006) An incrementally maintainable index for approximate lookups in hierarchical data. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Seoul, pp 247–258

  6. Augsten N, Böhlen M, Gamper J (2010) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):1–36

    Article  Google Scholar 

  7. Avis D (1983) A survey of heuristics for the weighted matching problem. Networks 13(4):475–493

    Article  MathSciNet  MATH  Google Scholar 

  8. Bernstein PA, Madhavan J, Rahm E (2011) Generic schema matching, ten years later. Proc VLDB Endow 4(11):695–701

    Google Scholar 

  9. Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, Tokyo, pp 865–876

  10. Cobéna G, Abiteboul S, Marian A (2002) Detecting changes in XML documents. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, San Jose, California, pp 41–52

  11. Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Proceedings of the 34th international colloquium on automata, languages and programming (ICALP 2007), vol 4596 of Lecture Notes in Computer Science. Springer, Wroclaw, pp 146–157

  12. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271

    Article  MathSciNet  MATH  Google Scholar 

  13. Dorneles CF, Gonçalves R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21

    Article  Google Scholar 

  14. Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19(2):248–264

    Article  MATH  Google Scholar 

  15. Feder T (1992) A new fixed point approach for stable networks and stable marriages. J Comput Syst Sci 45(2):233–284

    Article  MathSciNet  MATH  Google Scholar 

  16. Fredman ML, Tarjan RE (1987) Fibonacci heaps and their uses in improved network optimization algorithms. J ACM 34(3):596–615

    Article  MathSciNet  Google Scholar 

  17. Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15

    Article  MathSciNet  MATH  Google Scholar 

  18. Garofalakis M, Kumar A (2003) Correlating XML data streams using tree-edit distance embeddings. In: Proceedings of the ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS). ACM Press, San Diego, California, pp 143–154

  19. Gil JM, Montes JFA (2011) Evaluation of two heuristic approaches to solve the ontology meta-matching problem. Knowl Inf Syst 26(2):225–247

    Article  Google Scholar 

  20. Goldberg AV, Tarjan RE (1988) A new approach to the maximum-flow problem. J ACM 35(4):921–940

    Article  MathSciNet  MATH  Google Scholar 

  21. Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the international conference on very large databases (VLDB). Morgan Kaufmann, Roma, pp 491–500

  22. Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2002) Approximate XML joins. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Madison, pp 287–298

  23. Gusfield D, Irving RW (1989) The stable marriage problem: structure and algorithms. The MIT Press, Cambridge

    MATH  Google Scholar 

  24. Irving RW, Leather P, Gusfield D (1987) An efficient algorithm for the “optimal” stable marriage. J ACM 34(3):532–543

    Article  MathSciNet  Google Scholar 

  25. Kalfoglou Y, Schorlemmer M (2003) Ontology mapping: the state of the art. Knowl Eng Rev 18(1):1–31

    Article  Google Scholar 

  26. Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: Proceedings of the 6th European symposium on algorithms, vol 1461 of Lecture Notes in Computer Science. Springer, Venice, pp 91–102

  27. Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97

    Article  Google Scholar 

  28. Kurtzberg JM (1962) On approximation methods for the assignment problem. J ACM 9(4):419–439

    Article  MathSciNet  MATH  Google Scholar 

  29. Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1:8–17

    Google Scholar 

  30. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  31. Pawlik M, Augsten N (2011) RTED: a robust algorithm for the tree edit distance. Proc VLDB Endow (PVLDB) 5(4):334–345

    Google Scholar 

  32. Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350

    Article  MATH  Google Scholar 

  33. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754

  34. Shvaiko P, Euzenat J (2005) A survey of schema-based matching approaches. J Data Semantics IV:146–171

    Google Scholar 

  35. Shvaiko P, Euzenat J (2011) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng (99):1

  36. Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433

    Article  MathSciNet  MATH  Google Scholar 

  37. Ukkonen E (1992) Approximate string-matching with \(q\)-grams and maximal matches. Theor Comput Sci 92(1):191–211

    Article  MathSciNet  MATH  Google Scholar 

  38. van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, London

    Google Scholar 

  39. Xiao C, Wang W, Lin X (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc VLDB Endow 1(1):933–944

    Google Scholar 

  40. Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Baltimore, pp 754–765

  41. Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was partially funded by the SyRA (Synchronizing Residential Addresses) project of the Free University of Bozen-Bolzano, Italy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaus Augsten.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Augsten, N., Böhlen, M. & Gamper, J. The address connector: noninvasive synchronization of hierarchical data sources. Knowl Inf Syst 37, 639–663 (2013). https://doi.org/10.1007/s10115-012-0582-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0582-x

Keywords

Navigation