The address connector: noninvasive synchronization of hierarchical data sources

Augsten, Nikolaus; Böhlen, Michael; Gamper, Johann

doi:10.1007/s10115-012-0582-x

The address connector: noninvasive synchronization of hierarchical data sources

Regular Paper
Published: 11 November 2012

Volume 37, pages 639–663, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Nikolaus Augsten¹,
Michael Böhlen² &
Johann Gamper¹

211 Accesses
1 Citation
Explore all metrics

Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ONE-Geo: Client-Independent IP Geolocation Based on Owner Name Extraction

What3Words Geocoding Extensions

Article 15 February 2018

SEMI: A Scalable Entity Matching System Based on MapReduce

Notes

References

Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases (VLDB), VLDB Endowment, pp 918–929
Augsten N, Böhlen M, Dyreson C, Gamper J (2012) Windowed \(pq\)-grams for approximate joins of data-centric XML. VLDB J 21(4):463–488
Article Google Scholar
Augsten N, Böhlen M, Gamper J (2004) Reducing the integration of public administration databases to approximate tree matching. In: Electronic government—third international conference, LNCS 3183. Springer, Zaragoza, pp 102–107
Augsten N, Böhlen M, Gamper J (2005) Approximate matching of hierarchical data using \(pq\)-grams. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Trondheim, pp 301–312
Augsten N, Böhlen M, Gamper J (2006) An incrementally maintainable index for approximate lookups in hierarchical data. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Seoul, pp 247–258
Augsten N, Böhlen M, Gamper J (2010) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):1–36
Article Google Scholar
Avis D (1983) A survey of heuristics for the weighted matching problem. Networks 13(4):475–493
Article MathSciNet MATH Google Scholar
Bernstein PA, Madhavan J, Rahm E (2011) Generic schema matching, ten years later. Proc VLDB Endow 4(11):695–701
Google Scholar
Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, Tokyo, pp 865–876
Cobéna G, Abiteboul S, Marian A (2002) Detecting changes in XML documents. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, San Jose, California, pp 41–52
Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Proceedings of the 34th international colloquium on automata, languages and programming (ICALP 2007), vol 4596 of Lecture Notes in Computer Science. Springer, Wroclaw, pp 146–157
Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271
Article MathSciNet MATH Google Scholar
Dorneles CF, Gonçalves R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21
Article Google Scholar
Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19(2):248–264
Article MATH Google Scholar
Feder T (1992) A new fixed point approach for stable networks and stable marriages. J Comput Syst Sci 45(2):233–284
Article MathSciNet MATH Google Scholar
Fredman ML, Tarjan RE (1987) Fibonacci heaps and their uses in improved network optimization algorithms. J ACM 34(3):596–615
Article MathSciNet Google Scholar
Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15
Article MathSciNet MATH Google Scholar
Garofalakis M, Kumar A (2003) Correlating XML data streams using tree-edit distance embeddings. In: Proceedings of the ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS). ACM Press, San Diego, California, pp 143–154
Gil JM, Montes JFA (2011) Evaluation of two heuristic approaches to solve the ontology meta-matching problem. Knowl Inf Syst 26(2):225–247
Article Google Scholar
Goldberg AV, Tarjan RE (1988) A new approach to the maximum-flow problem. J ACM 35(4):921–940
Article MathSciNet MATH Google Scholar
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the international conference on very large databases (VLDB). Morgan Kaufmann, Roma, pp 491–500
Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2002) Approximate XML joins. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Madison, pp 287–298
Gusfield D, Irving RW (1989) The stable marriage problem: structure and algorithms. The MIT Press, Cambridge
MATH Google Scholar
Irving RW, Leather P, Gusfield D (1987) An efficient algorithm for the “optimal” stable marriage. J ACM 34(3):532–543
Article MathSciNet Google Scholar
Kalfoglou Y, Schorlemmer M (2003) Ontology mapping: the state of the art. Knowl Eng Rev 18(1):1–31
Article Google Scholar
Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: Proceedings of the 6th European symposium on algorithms, vol 1461 of Lecture Notes in Computer Science. Springer, Venice, pp 91–102
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97
Article Google Scholar
Kurtzberg JM (1962) On approximation methods for the assignment problem. J ACM 9(4):419–439
Article MathSciNet MATH Google Scholar
Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1:8–17
Google Scholar
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Pawlik M, Augsten N (2011) RTED: a robust algorithm for the tree edit distance. Proc VLDB Endow (PVLDB) 5(4):334–345
Google Scholar
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350
Article MATH Google Scholar
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754
Shvaiko P, Euzenat J (2005) A survey of schema-based matching approaches. J Data Semantics IV:146–171
Google Scholar
Shvaiko P, Euzenat J (2011) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng (99):1
Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433
Article MathSciNet MATH Google Scholar
Ukkonen E (1992) Approximate string-matching with \(q\)-grams and maximal matches. Theor Comput Sci 92(1):191–211
Article MathSciNet MATH Google Scholar
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, London
Google Scholar
Xiao C, Wang W, Lin X (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc VLDB Endow 1(1):933–944
Google Scholar
Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Baltimore, pp 754–765
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work was partially funded by the SyRA (Synchronizing Residential Addresses) project of the Free University of Bozen-Bolzano, Italy.

Author information

Authors and Affiliations

Faculty of Computer Science, Free University of Bozen-Bolzano, Bolzano, Italy
Nikolaus Augsten & Johann Gamper
Department of Informatics, University of Zurich, Zurich, Switzerland
Michael Böhlen

Authors

Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar
Michael Böhlen
View author publications
You can also search for this author in PubMed Google Scholar
Johann Gamper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaus Augsten.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Augsten, N., Böhlen, M. & Gamper, J. The address connector: noninvasive synchronization of hierarchical data sources. Knowl Inf Syst 37, 639–663 (2013). https://doi.org/10.1007/s10115-012-0582-x

Download citation

Received: 18 November 2011
Revised: 30 July 2012
Accepted: 18 October 2012
Published: 11 November 2012
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10115-012-0582-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The address connector: noninvasive synchronization of hierarchical data sources

Abstract

Access this article

Similar content being viewed by others

ONE-Geo: Client-Independent IP Geolocation Based on Owner Name Extraction

What3Words Geocoding Extensions

SEMI: A Scalable Entity Matching System Based on MapReduce

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The address connector: noninvasive synchronization of hierarchical data sources

Abstract

Access this article

Similar content being viewed by others

ONE-Geo: Client-Independent IP Geolocation Based on Owner Name Extraction

What3Words Geocoding Extensions

SEMI: A Scalable Entity Matching System Based on MapReduce

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation