skip to main content
10.1145/3678717.3691277acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article
Open access

Accurate Customer Address Matching via Weak Supervision for Geocode Learning

Published: 22 November 2024 Publication History

Abstract

Determining the precise location of customers is important for an efficient and reliable delivery experience, both for customers and delivery associates. Address text is a primary source of information provided by customers about their location. In this paper, we study the important and challenging task of matching free-form customer address text to determine if two addresses represent the same physical building. We introduce a novel address matching framework that leverages transformer-based encoder to prevent tedious and time-consuming efforts spent on manual feature engineering by the baseline model. Furthermore, our proposed framework employs weak supervision to leverage historic delivery information and generate high-quality labeled data. This reduces the requirement for massive amounts of labeled data, typically needed for transformer-based models. Our experiments on manually curated datasets demonstrate the effective and generic nature of our approach, as we achieve 15.57% improvement in recall at 95% precision, on average, compared to the current baseline model across four geographies. We also introduce delivery point (DP) geocode learning for cold-start addresses as a downstream application of customer address matching. In addition to offline experiments, we performed online A/B experiments for DP geocode learning with our proposed approach and observed delivery precision improved by 8.09% and delivery defects reduced by 11.78% on average across four geographies in comparison to the baseline model.

References

[1]
Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-scale deduplication with constraints using dedupalog. In 2009 IEEE 25th International Conference on Data Engineering. IEEE, 952--963.
[2]
Pasquale Balsebre, Dezhong Yao, Gao Cong, and Zhen Hai. 2022. Geospatial Entity Resolution. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW '22). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3485447.3512026
[3]
Nils Barlaug and Jon Atle Gulla. 2021. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 3 (2021), 1--37.
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146. https://doi.org/10.1162/tacl_a_00051
[5]
José Cañete, Sebastián Donoso, Felipe Bravo-Marquez, Andrés Carvallo, and Vladimir Araujo. 2022. ALBETO and DistilBETO: Lightweight Spanish Language Models. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France.
[6]
Riccardo Cappuzzo and Paolo Papotti. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1335--1349.
[7]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.
[8]
Nitin R Chopde and Mangesh Nichat. 2013. Landmark based shortest path detection by using A* and Haversine formula. International Journal of Innovative Research in Computer and Communication Engineering 1, 2 (2013), 298--302.
[9]
Peter Christen. 2019. Data linkage: The big picture. Harvard Data Science Review 1, 2 (2019).
[10]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT's Attention. CoRR abs/1906.04341 (2019). arXiv:1906.04341 http://arxiv.org/abs/1906.04341
[11]
Sam Comber and Daniel Arribas-Bel. 2019. Machine learning innovations in address matching: A practical comparison of word2vec and CRFs. Transactions in GIS 23, 2 (2019), 334--348.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
[13]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454--1467.
[14]
Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an oracle. Proceedings of the VLDB Endowment 9, 5 (2016), 384--395.
[15]
George Forman. 2021. Getting Your Package to the Right Place: Supervised Machine Learning for Geolocation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 403--419.
[16]
Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration. CoRR abs/2108.08090 (2021). arXiv:2108.08090 https://arxiv.org/abs/2108.08090
[17]
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O'Reilly Media, Inc.".
[18]
Govind and Saurabh Sohoney. 2022. Learning Geolocations for Cold-Start and Hard-to-Resolve Addresses via Deep Metric Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. Association for Computational Linguistics, Abu Dhabi, UAE, 322--331. https://aclanthology.org/2022.emnlp-industry.33
[19]
Mauricio A Hernández and Salvatore J Stolfo. 1995. The merge/purge problem for large databases. ACM Sigmod Record 24, 2 (1995), 127--138.
[20]
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. arXiv:2003.11080 [cs.CL]
[21]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
[22]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).
[23]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
[24]
Wuwei Lan, Yang Chen, Wei Xu, and Alan Ritter. 2020. GigaBERT: Zero-shot Transfer Learning from English to Arabic. In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing(EMNLP).
[25]
Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang. 2021. Improving the Efficiency and Effectiveness for BERT-based Entity Resolution. In AAAI Conference on Artificial Intelligence.
[26]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-ChiewTan. 2020. Deep Entity Matching with Pre-Trained Language Models. CoRR abs/2004.00584 (2020). arXiv:2004.00584 https://arxiv.org/abs/2004.00584
[27]
Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-bert pre-training model for query rewriting in poi search. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2209--2214.
[28]
Saket Maheshwary and Saurabh Sohoney. 2023. Learning geolocation by accurately matching customer addresses via graph based active learning. In Companion Proceedings of the ACM Web Conference 2023. 457--463.
[29]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.
[30]
Stephen M Omohundro. 1989. Five balltree construction algorithms. Technical report.
[31]
Ralph Peeters and Christian Bizer. 2022. Supervised contrastive learning for product matching. In Companion Proceedings of the Web Conference 2022. 248--251.
[32]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation, In Glove: Global Vectors for Word Representation. EMNLP 14, 1532--1543. https://doi.org/10.3115/v1/D14-1162
[33]
Vamsi Krishna Penumadu, Nitesh Methani, and Saurabh Sohoney. 2022. Learning geospatially aware place embeddings via weak-supervision. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. 1--10.
[34]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
[35]
Kevin Sahr. 2019. Central place indexing: Hierarchical linear indexing systems for mixed-aperture hexagonal discrete global grid systems. Cartographica: The International Journal for Geographic Information and Geovisualization 54, 1 (2019), 16--29.
[36]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.
[37]
David W Scott. 1992. Multivariate density estimation: Theory, practice and visualisation. John Willey and Sons. Inc., New York (1992).
[38]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189--202.
[39]
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).
[40]
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical NLP Pipeline. CoRR abs/1905.05950 (2019). arXiv:1905.05950 http://arxiv.org/abs/1905.05950
[41]
Saravanan Thirumuruganathan, Shameem Ahamed Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq R. Joty. 2018. Reuse and Adaptation for Entity Resolution through Transfer Learning. CoRRabs/1809.11084 (2018). arXiv:1809.11084 http://arxiv.org/abs/1809.11084
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
[43]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
[44]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2019. AutoER: Automated Entity Resolution using Generative Modelling. CoRR abs/1908.06049 (2019). arXiv:1908.06049 http://arxiv.org/abs/1908.06049
[45]
Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised Data Augmentation. CoRRabs/1904.12848(2019). arXiv:1904.12848 http://arxiv.org/abs/1904.12848
[46]
Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2022. Unsupervised Entity Resolution With Blocking and Graph Algorithms. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2022), 1501--1515. https://doi.org/10.1109/TKDE.2020.2991063
[47]
Sheng Zhang, Hao Cheng, Shikhar Vashishth, Cliff Wong, Jinfeng Xiao, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Knowledge-Rich Self-Supervised Entity Linking. CoRR abs/2112.07887 (2021). arXiv:2112.07887 https://arxiv.org/abs/2112.07887
[48]
Yuan Zhang, Hongshen Chen, Yihong Zhao, Qun Liu, and Dawei Yin. 2018. Learning Tag Dependencies for Sequence Tagging. In IJCAI. 4581--4587.
[49]
Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for Language Understanding. arXiv:1909.02209 [cs.CL]
[50]
Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. 2022. Moebert: from bert to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 (2022).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
October 2024
743 pages
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Badges

  • Best Industry Paper

Author Tags

  1. Entity Matching
  2. Geocoding
  3. Language Models
  4. Weak Supervision

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGSPATIAL '24
Sponsor:

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;
Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 27
    Total Downloads
  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)21
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media