research-article

Open access

Accurate Customer Address Matching via Weak Supervision for Geocode Learning

Authors:

Saket Maheshwary,

Saurabh SohoneyAuthors Info & Claims

SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems

Pages 454 - 464

https://doi.org/10.1145/3678717.3691277

Published: 22 November 2024 Publication History

Abstract

Determining the precise location of customers is important for an efficient and reliable delivery experience, both for customers and delivery associates. Address text is a primary source of information provided by customers about their location. In this paper, we study the important and challenging task of matching free-form customer address text to determine if two addresses represent the same physical building. We introduce a novel address matching framework that leverages transformer-based encoder to prevent tedious and time-consuming efforts spent on manual feature engineering by the baseline model. Furthermore, our proposed framework employs weak supervision to leverage historic delivery information and generate high-quality labeled data. This reduces the requirement for massive amounts of labeled data, typically needed for transformer-based models. Our experiments on manually curated datasets demonstrate the effective and generic nature of our approach, as we achieve 15.57% improvement in recall at 95% precision, on average, compared to the current baseline model across four geographies. We also introduce delivery point (DP) geocode learning for cold-start addresses as a downstream application of customer address matching. In addition to offline experiments, we performed online A/B experiments for DP geocode learning with our proposed approach and observed delivery precision improved by 8.09% and delivery defects reduced by 11.78% on average across four geographies in comparison to the baseline model.

References

[1]

Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-scale deduplication with constraints using dedupalog. In 2009 IEEE 25th International Conference on Data Engineering. IEEE, 952--963.

Digital Library

[2]

Pasquale Balsebre, Dezhong Yao, Gao Cong, and Zhen Hai. 2022. Geospatial Entity Resolution. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW '22). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3485447.3512026

Digital Library

[3]

Nils Barlaug and Jon Atle Gulla. 2021. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 3 (2021), 1--37.

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146. https://doi.org/10.1162/tacl_a_00051

[5]

José Cañete, Sebastián Donoso, Felipe Bravo-Marquez, Andrés Carvallo, and Vladimir Araujo. 2022. ALBETO and DistilBETO: Lightweight Spanish Language Models. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France.

[6]

Riccardo Cappuzzo and Paolo Papotti. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1335--1349.

Digital Library

[7]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.

Digital Library

[8]

Nitin R Chopde and Mangesh Nichat. 2013. Landmark based shortest path detection by using A* and Haversine formula. International Journal of Innovative Research in Computer and Communication Engineering 1, 2 (2013), 298--302.

[9]

Peter Christen. 2019. Data linkage: The big picture. Harvard Data Science Review 1, 2 (2019).

[10]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT's Attention. CoRR abs/1906.04341 (2019). arXiv:1906.04341 http://arxiv.org/abs/1906.04341

[11]

Sam Comber and Daniel Arribas-Bel. 2019. Machine learning innovations in address matching: A practical comparison of word2vec and CRFs. Transactions in GIS 23, 2 (2019), 334--348.

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805

[13]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454--1467.

Digital Library

[14]

Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an oracle. Proceedings of the VLDB Endowment 9, 5 (2016), 384--395.

Digital Library

[15]

George Forman. 2021. Getting Your Package to the Right Place: Supervised Machine Learning for Geolocation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 403--419.

Digital Library

[16]

Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration. CoRR abs/2108.08090 (2021). arXiv:2108.08090 https://arxiv.org/abs/2108.08090

[17]

Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O'Reilly Media, Inc.".

[18]

Govind and Saurabh Sohoney. 2022. Learning Geolocations for Cold-Start and Hard-to-Resolve Addresses via Deep Metric Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. Association for Computational Linguistics, Abu Dhabi, UAE, 322--331. https://aclanthology.org/2022.emnlp-industry.33

[19]

Mauricio A Hernández and Salvatore J Stolfo. 1995. The merge/purge problem for large databases. ACM Sigmod Record 24, 2 (1995), 127--138.

Digital Library

[20]

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. arXiv:2003.11080 [cs.CL]

[21]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).

[22]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).

[23]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]

[24]

Wuwei Lan, Yang Chen, Wei Xu, and Alan Ritter. 2020. GigaBERT: Zero-shot Transfer Learning from English to Arabic. In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing(EMNLP).

[25]

Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang. 2021. Improving the Efficiency and Effectiveness for BERT-based Entity Resolution. In AAAI Conference on Artificial Intelligence.

[26]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-ChiewTan. 2020. Deep Entity Matching with Pre-Trained Language Models. CoRR abs/2004.00584 (2020). arXiv:2004.00584 https://arxiv.org/abs/2004.00584

[27]

Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-bert pre-training model for query rewriting in poi search. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2209--2214.

[28]

Saket Maheshwary and Saurabh Sohoney. 2023. Learning geolocation by accurately matching customer addresses via graph based active learning. In Companion Proceedings of the ACM Web Conference 2023. 457--463.

Digital Library

[29]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.

Digital Library

[30]

Stephen M Omohundro. 1989. Five balltree construction algorithms. Technical report.

[31]

Ralph Peeters and Christian Bizer. 2022. Supervised contrastive learning for product matching. In Companion Proceedings of the Web Conference 2022. 248--251.

Digital Library

[32]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation, In Glove: Global Vectors for Word Representation. EMNLP 14, 1532--1543. https://doi.org/10.3115/v1/D14-1162

[33]

Vamsi Krishna Penumadu, Nitesh Methani, and Saurabh Sohoney. 2022. Learning geospatially aware place embeddings via weak-supervision. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. 1--10.

Digital Library

[34]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).

[35]

Kevin Sahr. 2019. Central place indexing: Hierarchical linear indexing systems for mixed-aperture hexagonal discrete global grid systems. Cartographica: The International Journal for Geographic Information and Geovisualization 54, 1 (2019), 16--29.

[36]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[37]

David W Scott. 1992. Multivariate density estimation: Theory, practice and visualisation. John Willey and Sons. Inc., New York (1992).

[38]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189--202.

Digital Library

[39]

Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Digital Library

[40]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical NLP Pipeline. CoRR abs/1905.05950 (2019). arXiv:1905.05950 http://arxiv.org/abs/1905.05950

[41]

Saravanan Thirumuruganathan, Shameem Ahamed Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq R. Joty. 2018. Reuse and Adaptation for Entity Resolution through Transfer Learning. CoRRabs/1809.11084 (2018). arXiv:1809.11084 http://arxiv.org/abs/1809.11084

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762

Digital Library

[43]

Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).

[44]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2019. AutoER: Automated Entity Resolution using Generative Modelling. CoRR abs/1908.06049 (2019). arXiv:1908.06049 http://arxiv.org/abs/1908.06049

[45]

Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised Data Augmentation. CoRRabs/1904.12848(2019). arXiv:1904.12848 http://arxiv.org/abs/1904.12848

[46]

Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2022. Unsupervised Entity Resolution With Blocking and Graph Algorithms. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2022), 1501--1515. https://doi.org/10.1109/TKDE.2020.2991063

[47]

Sheng Zhang, Hao Cheng, Shikhar Vashishth, Cliff Wong, Jinfeng Xiao, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Knowledge-Rich Self-Supervised Entity Linking. CoRR abs/2112.07887 (2021). arXiv:2112.07887 https://arxiv.org/abs/2112.07887

[48]

Yuan Zhang, Hongshen Chen, Yihong Zhao, Qun Liu, and Dawei Yin. 2018. Learning Tag Dependencies for Sequence Tagging. In IJCAI. 4581--4587.

[49]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for Language Understanding. arXiv:1909.02209 [cs.CL]

[50]

Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. 2022. Moebert: from bert to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 (2022).

Index Terms

Accurate Customer Address Matching via Weak Supervision for Geocode Learning

Recommendations

Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active Learning
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

We propose a novel adaptation of graph-based active learning for customer address resolution or de-duplication, with the aim to determine if two addresses represent the same physical building or not. For delivery systems, improving address resolution ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
Labelling for Venue Visit Detection by Matching Wi-Fi Hotspots with Businesses
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

User behaviour data is essential for modern companies, as it allows them to measure the impact of decisions they make and to gain new insights. A particular type of such data is user location trajectories, which can be clustered into Points of Interest, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems

October 2024

743 pages

ISBN:9798400711077

DOI:10.1145/3678717

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGSPATIAL: ACM Special Interest Group on Spatial Information

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Badges

Best Industry Paper

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGSPATIAL '24

Sponsor:

SIGSPATIAL

SIGSPATIAL '24: The 32nd ACM International Conference on Advances in Geographic Information Systems

October 29 - November 1, 2024

GA, Atlanta, USA

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;

Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
27
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)21

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents