Abstract
Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.
Similar content being viewed by others
References
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases (VLDB), VLDB Endowment, pp 918–929
Augsten N, Böhlen M, Dyreson C, Gamper J (2012) Windowed \(pq\)-grams for approximate joins of data-centric XML. VLDB J 21(4):463–488
Augsten N, Böhlen M, Gamper J (2004) Reducing the integration of public administration databases to approximate tree matching. In: Electronic government—third international conference, LNCS 3183. Springer, Zaragoza, pp 102–107
Augsten N, Böhlen M, Gamper J (2005) Approximate matching of hierarchical data using \(pq\)-grams. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Trondheim, pp 301–312
Augsten N, Böhlen M, Gamper J (2006) An incrementally maintainable index for approximate lookups in hierarchical data. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Seoul, pp 247–258
Augsten N, Böhlen M, Gamper J (2010) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):1–36
Avis D (1983) A survey of heuristics for the weighted matching problem. Networks 13(4):475–493
Bernstein PA, Madhavan J, Rahm E (2011) Generic schema matching, ten years later. Proc VLDB Endow 4(11):695–701
Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, Tokyo, pp 865–876
Cobéna G, Abiteboul S, Marian A (2002) Detecting changes in XML documents. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, San Jose, California, pp 41–52
Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Proceedings of the 34th international colloquium on automata, languages and programming (ICALP 2007), vol 4596 of Lecture Notes in Computer Science. Springer, Wroclaw, pp 146–157
Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271
Dorneles CF, Gonçalves R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21
Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19(2):248–264
Feder T (1992) A new fixed point approach for stable networks and stable marriages. J Comput Syst Sci 45(2):233–284
Fredman ML, Tarjan RE (1987) Fibonacci heaps and their uses in improved network optimization algorithms. J ACM 34(3):596–615
Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15
Garofalakis M, Kumar A (2003) Correlating XML data streams using tree-edit distance embeddings. In: Proceedings of the ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS). ACM Press, San Diego, California, pp 143–154
Gil JM, Montes JFA (2011) Evaluation of two heuristic approaches to solve the ontology meta-matching problem. Knowl Inf Syst 26(2):225–247
Goldberg AV, Tarjan RE (1988) A new approach to the maximum-flow problem. J ACM 35(4):921–940
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the international conference on very large databases (VLDB). Morgan Kaufmann, Roma, pp 491–500
Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2002) Approximate XML joins. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Madison, pp 287–298
Gusfield D, Irving RW (1989) The stable marriage problem: structure and algorithms. The MIT Press, Cambridge
Irving RW, Leather P, Gusfield D (1987) An efficient algorithm for the “optimal” stable marriage. J ACM 34(3):532–543
Kalfoglou Y, Schorlemmer M (2003) Ontology mapping: the state of the art. Knowl Eng Rev 18(1):1–31
Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: Proceedings of the 6th European symposium on algorithms, vol 1461 of Lecture Notes in Computer Science. Springer, Venice, pp 91–102
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97
Kurtzberg JM (1962) On approximation methods for the assignment problem. J ACM 9(4):419–439
Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1:8–17
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Pawlik M, Augsten N (2011) RTED: a robust algorithm for the tree edit distance. Proc VLDB Endow (PVLDB) 5(4):334–345
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754
Shvaiko P, Euzenat J (2005) A survey of schema-based matching approaches. J Data Semantics IV:146–171
Shvaiko P, Euzenat J (2011) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng (99):1
Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433
Ukkonen E (1992) Approximate string-matching with \(q\)-grams and maximal matches. Theor Comput Sci 92(1):191–211
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, London
Xiao C, Wang W, Lin X (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc VLDB Endow 1(1):933–944
Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Baltimore, pp 754–765
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Acknowledgments
This work was partially funded by the SyRA (Synchronizing Residential Addresses) project of the Free University of Bozen-Bolzano, Italy.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Augsten, N., Böhlen, M. & Gamper, J. The address connector: noninvasive synchronization of hierarchical data sources. Knowl Inf Syst 37, 639–663 (2013). https://doi.org/10.1007/s10115-012-0582-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0582-x