skip to main content
10.1145/3405962.3405964acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Using schema.org Annotations for Training and Maintaining Product Matchers

Published: 24 August 2020 Publication History

Abstract

Product matching is a central task within e-commerce applications such as price comparison portals and online market places. State-of-the-art product matching methods achieve F1 scores above 0.90 using deep learning techniques combined with huge amounts of training data (e.g > 100K pairs of offers). Gathering and maintaining such large training corpora is costly, as it implies labeling pairs of offers as matches or non-matches. Acquiring the ability to be good at product matching thus means a major investment for an e-commerce company. This paper shows that the manual labeling of training data for product matching can be replaced by relying exclusively on schema.org annotations gathered from the public Web. We show that using only schema.org data for training, we are able to achieve F1 scores between 0.92 and 0.95 depending on the product category. As new products appear everyday, it is important that matching models can be maintained with justifiable effort. In order to give practical advice on how to maintain matching models, we compare the performance of deep learning and traditional matching models on unseen products and experiment with different fine-tuning and re-training strategies for model maintenance, again using only schema.org annotations as training data. Finally, as using the public Web as distant supervision carries inherent noise, we evaluate deep learning and traditional matching models with regards to their label-noise resistance and show that deep learning is able to deal with the amounts of identifier-noise found in schema.org annotations.

References

[1]
L. Akritidis and P. Bozanis. 2018. Effective Unsupervised Matching of Product Titles with K-Combinations and Permutations. In 2018 Innovations in Intelligent Systems and Applications (INISTA). 1--10.
[2]
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2019. A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017.
[3]
Luciano Barbosa. 2019. Learning Representations of Web Entities for Entity Resolution. International Journal of Web Information Systems 15, 3 (2019), 346--358.
[4]
Christian Bizer, Anna Primpeli, and Ralph Peeters. 2019. Using the Semantic Web as a Source of Training Data. Datenbank-Spektrum 19, 2 (2019), 127--135.
[5]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (Dec. 2017), 135--146.
[6]
Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - a Step Forward in Data Integration. In International Conference on Extending Database Technology, 30 March-2 April 2020.
[7]
Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2019. End-to-End Entity Resolution for Big Data: A Survey. arXiv:1905.06397 [cs] (2019).
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the Association for Computational Linguistics. 4171--4186.
[9]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proc. VLDB Endow. 11, 11 (2018), 1454--1467.
[10]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1--16.
[11]
Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.
[12]
Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-End Multi-Perspective Matching for Entity Resolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 4961--4967.
[13]
Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Rajeev Rastogi, and Srinivasan Sengamedu. 2012. Matching Product Titles Using Web-Based Enrichment. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 605--614.
[14]
Anitha Kannan, Inmar E. Givoni, Rakesh Agrawal, and Ariel Fuxman. 2011. Matching Unstructured Product Offers to Structured Product Specifications. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 404--412.
[15]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-Resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851--5861.
[16]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. Proc. VLDB Endow. 9, 12 (2016), 1197--1208.
[17]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1-2 (2010), 484--493.
[18]
Hanna Köpcke, Andreas Thor, Stefan Thomas, and Erhard Rahm. 2012. Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology. 545--550.
[19]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. arXiv:2004.00584 [cs] (April 2020).
[20]
Robert Meusel and Heiko Paulheim. 2015. Heuristics for Fixing Common Errors in Deployed Schema.Org Microdata. In Proceedings of the 12th European Semantic Web Conference - Volume 9088. 152--168.
[21]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.
[22]
Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2249--2255.
[23]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, and Vincent Dubourg. 2011. Scikit-Learn: Machine Learning in Python. Journal of machine learning research 12 (2011), 2825--2830.
[24]
Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC training dataset and gold standard for large-scale product matching. In Workshop on e-Commerce and NLP (ECNLP2019), Companion Proceedings of WWW. 381--386.
[25]
Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network Based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the Association for Computational Linguistics, Volume 3 (Industry Papers). 8--15.
[26]
Saravanan Thirumuruganathan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and Adaptation for Entity Resolution through Transfer Learning. arXiv:1809.11084 [cs, stat] (2018).
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 6000--6010.
[28]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2019. AutoER: Automated Entity Resolution Using Generative Modelling. arXiv:1908.06049 [cs] (Aug. 2019).
[29]
Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2020. Product Knowledge Graph Embedding for E-Commerce. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM '20). 672--680.
[30]
Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020. Multi-Context Attention for Entity Matching. In Proceedings of The Web Conference 2020 (WWW '20). 2634--2640.
[31]
Linhong Zhu, Majid Ghasemi-Gol, Pedro Szekely, Aram Galstyan, and Craig A. Knoblock. 2016. Unsupervised Entity Resolution on Multi-Type Graphs. In The Semantic Web - ISWC 2016. 649--667.

Cited By

View all
  • (2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
  • (2023)Building Effective Features based on Automatic Learning for Smart Search2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT56793.2023.10053448(897-901)Online publication date: 5-Jan-2023
  • (2023)Using Machine Learning and NLP for the Product Matching ProblemIntelligent Sustainable Systems10.1007/978-981-19-7663-6_41(439-448)Online publication date: 25-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics
June 2020
279 pages
ISBN:9781450375429
DOI:10.1145/3405962
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. distant supervision
  3. e-commerce
  4. product matching
  5. schema.org

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WIMS 2020

Acceptance Rates

WIMS 2020 Paper Acceptance Rate 35 of 63 submissions, 56%;
Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
  • (2023)Building Effective Features based on Automatic Learning for Smart Search2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT56793.2023.10053448(897-901)Online publication date: 5-Jan-2023
  • (2023)Using Machine Learning and NLP for the Product Matching ProblemIntelligent Sustainable Systems10.1007/978-981-19-7663-6_41(439-448)Online publication date: 25-Jan-2023
  • (2022)An Entity-Matching System Based on Multimodal Data for Two Major E-Commerce Stores in MexicoMathematics10.3390/math1015256410:15(2564)Online publication date: 23-Jul-2022
  • (2022)Multilingual Transformers for Product Matching – Experiments and a New Benchmark in Polish2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE55066.2022.9882843(1-8)Online publication date: 18-Jul-2022
  • (2022)An Exploratory Study on Utilising the Web of Linked Data for Product Data MiningSN Computer Science10.1007/s42979-022-01415-34:1Online publication date: 17-Oct-2022
  • (2021)Dual-objective fine-tuning of BERT for entity matchingProceedings of the VLDB Endowment10.14778/3467861.346787814:10(1913-1921)Online publication date: 1-Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media