skip to main content
10.1145/3678717.3691290acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
short-paper
Open access

Address De-duplication using Iterative k-Core Graph Decomposition

Published: 22 November 2024 Publication History

Abstract

A de-duplicated and complete address catalog is essential for any application or business which needs to manage large volumes of address data such as delivery logistics, first-responder services and government databases. For catalog creation, address data is usually procured from disparate sources, which often vary in quality, coverage, and introduce duplicates or variations of the same physical address. Address de-duplication is therefore a crucial step for creating a clean and unified address catalog. De-duplication is even more challenging at a global scale, due to diversity in address writing styles, which might lack standardized addressing systems and can be multi-lingual. In this paper, we formulate address de-duplication as an unsupervised graph clustering problem and propose SANGAM, a novel adaptation of the k-core graph decomposition algorithm. We evaluate this solution on diverse geographic regions around the world. In comparison to existing methods, we observe improvements on the F-beta measure for three datasets. Our key contributions are: (1) formulating address de-duplication as a graph clustering problem, (2) proposing SANGAM, a robust and generic de-duplication approach, and (3) validating its effectiveness on diverse geographies across three continents - Americas, Africa and Europe. (4) Further, we deploy our solution and show the positive impact on geocode learning, an essential application of our solution.

References

[1]
Charu C Aggarwal and Haixun Wang. 2010. A survey of clustering algorithms for graph data. Managing and mining graph data (2010), 275--301.
[2]
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.
[3]
Vladimir Batagelj and Matjaz Zaversnik. 2003. An o (m) algorithm for cores decomposition of networks. arXiv preprint cs/0310049 (2003).
[4]
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
[5]
Yizong Cheng, Chen Lu, and Nan Wang. 2013. Local k-core clustering for gene networks. In 2013 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 9--15.
[6]
Santo Fortunato. 2010. Community detection in graphs. Physics reports 486, 3--5 (2010), 75--174.
[7]
John Hopcroft and Robert Tarjan. 1973. Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16, 6 (1973), 372--378.
[8]
Xue Jiao, Yonggang Chen, and Rui Dong. 2020. An unsupervised image segmentation method combining graph clustering and high-level feature representation. Neurocomputing 409 (2020), 83--92.
[9]
P Liu, X Wang, CH Hu, and TH Hu. 2012. Bioinformatics analysis with graph-based clustering to detect gastric cancer-related pathways. Genet Mol Res 11, 3 (2012), 3497--3504.
[10]
Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E Tarjan. 2007. Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph. Springer, 56--67.
[11]
Mark EJ Newman. 2006. Modularity and community structure in networks. Proceedings of the national academy of sciences 103, 23 (2006), 8577--8582.
[12]
Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review 1, 1 (2007), 27--64.
[13]
Hamid K Seifoddini. 1989. Single linkage versus average linkage clustering in machine cells formation applications. Computers & Industrial Engineering 16, 3 (1989), 419--426.
[14]
Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, and Dahua Lin. 2020. Learning to cluster faces via confidence and connectivity estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13369--13378.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
October 2024
743 pages
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Author Tags

  1. Address De-duplication
  2. Clustering
  3. Graph Decomposition

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

SIGSPATIAL '24
Sponsor:

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;
Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 25
    Total Downloads
  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)18
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media