research-article

The WDC Training Dataset and Gold Standard for Large-Scale Product Matching

Authors:

Christian BizerAuthors Info & Claims

WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

Pages 381 - 386

https://doi.org/10.1145/3308560.3316609

Published: 13 May 2019 Publication History

Abstract

A current research question in the area of entity resolution (also called link discovery or duplicate detection) is whether and in which cases embeddings and deep neural network based matching methods outperform traditional symbolic matching methods. The problem with answering this question is that deep learning based matchers need large amounts of training data. The entity resolution benchmark datasets that are currently available to the public are too small to properly evaluate this new family of matching methods. The WDC Training Dataset for Large-Scale Product Matching fills this gap. The English language subset of the training dataset consists of 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e-shops which provide schema.org annotations including some form of product ID such as a GTIN or MPN. We also created a gold standard by manually verifying 2200 pairs of offers belonging to four product categories. Using a subset of our training dataset together with this gold standard, we are able to publicly replicate the recent result of Mudgal et al. that embeddings and deep neural network based matching methods outperform traditional symbolic matching methods on less structured data.

References

[1]

Manel Achichi, Michelle Cheatham, 2017. Results of the Ontology Alignment Evaluation Initiative 2017. In Proceedings of OM 2017-12th ISWC workshop on ontology matching. 61–113.

[2]

Sanjib Das, AnHai Doan, 2016. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data.

[3]

Sanjib Das, Paul Suganthan G.C., 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In Proceedings of the 2017 ACM International Conference on Management of Data(SIGMOD ’17). 1431–1446.

Digital Library

[4]

Evangelia Daskalaki, Giorgos Flouris, 2016. Instance Matching Benchmarks in the Era of Linked Data. Journal of Web Semantics 39 (2016), 1 – 14.

Digital Library

[5]

Chaitanya Gokhale, Sanjib Das, 2014. Corleone: Hands-off Crowdsourcing for Entity Matching. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data(SIGMOD ’14). 601–612.

Digital Library

[6]

Anitha Kannan, Inmar E. Givoni, 2011. Matching Unstructured Product Offers to Structured Product Specifications. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’11). 404–412.

Digital Library

[7]

Elias Kärle, Anna Fensel, 2016. Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in the Hotel Domain. In Information and Communication Technologies in Tourism 2016. Cham, 99–112.

[8]

Pradap Konda, Jeff Naughton, and et al.2016. Magellan: toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197–1208.

Digital Library

[9]

Hanna Köpcke and Erhard Rahm. 2008. Training selection for tuning entity matching. In Proceedings of the 6th International Workshop on Quality in Databases and Management of Uncertain Data(QDB/MUD ’08). 3–12.

[10]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493.

Digital Library

[11]

Robert Meusel and Heiko Paulheim. 2015. Heuristics for Fixing Common Errors in Deployed Schema.Org Microdata. In Proceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 9088. 152–168.

Digital Library

[12]

Sidharth Mudgal, Han Li, 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data(SIGMOD ’18). 19–34.

Digital Library

[13]

Petar Petrovski and Christian Bizer. 2017. Extracting Attribute-value Pairs from Product Specifications on the Web. In Proceedings of the International Conference on Web Intelligence(WI ’17). 558–565.

Digital Library

[14]

Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Integrating Product Data from Websites Offering Microdata Markup. In Companion Proceedings of the 23rd International Conference on World Wide Web(WWW ’14 Companion). 1299–1304.

Digital Library

[15]

Petar Petrovski, Anna Primpeli, 2017. The WDC Gold Standards for Product Feature Extraction and Product Matching. In E-Commerce and Web Technologies. 73–86.

[16]

Disheng Qiu, Luciano Barbosa, 2015. Dexter: Large-scale Discovery and Extraction of Product Specifications on the Web. Proceedings of VLDB Endowment 8, 13 (2015), 2194–2205.

Digital Library

[17]

Petar Ristoski, Petar Petrovski, 2018. A machine learning approach for product matching and categorization. Semantic Web 9, 5 (2018), 707–728.

Digital Library

[18]

Kashif Shah, Selcuk Kopru, 2018. Neural Network based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). 8–15.

Cited By

Zhang ZZeng WTang JHuang HZhao X(2025)Active in-context learning for cross-domain entity resolutionInformation Fusion10.1016/j.inffus.2024.102816117:COnline publication date: 1-May-2025
https://dl.acm.org/doi/10.1016/j.inffus.2024.102816
Nananukul NKekriwal M(2025)Balancing Efficiency and Quality in LLM-Based Entity Resolution on Structured DataSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_21(278-293)Online publication date: 24-Jan-2025
https://doi.org/10.1007/978-3-031-78548-1_21
Cao HDu SHu JYang YHorng SLi T(2024)Graph Deep Active Learning Framework for Data DeduplicationBig Data Mining and Analytics10.26599/BDMA.2023.90200407:3(753-764)Online publication date: Sep-2024
https://doi.org/10.26599/BDMA.2023.9020040
Show More Cited By

Index Terms

The WDC Training Dataset and Gold Standard for Large-Scale Product Matching
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Supervised Contrastive Learning for Product Matching
WWW '22: Companion Proceedings of the Web Conference 2022

Contrastive learning has moved the state of the art for many tasks in computer vision and information retrieval in recent years. This poster is the first work that applies supervised contrastive learning to the task of product matching in e-commerce ...
Using schema.org Annotations for Training and Maintaining Product Matchers
WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics

Product matching is a central task within e-commerce applications such as price comparison portals and online market places. State-of-the-art product matching methods achieve F1 scores above 0.90 using deep learning techniques combined with huge amounts ...
ProMap: Product Mapping Datasets
Advances in Information Retrieval
Abstract
The goal of product mapping is to decide, whether two listings from two different e-shops describe the same products. Existing datasets of matching and non-matching pairs of products, however, often suffer from incomplete product information or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

May 2019

1331 pages

ISBN:9781450366755

DOI:10.1145/3308560

Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '19

WWW '19: The Web Conference

May 13 - 17, 2019

San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
659
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)4

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZZeng WTang JHuang HZhao X(2025)Active in-context learning for cross-domain entity resolutionInformation Fusion10.1016/j.inffus.2024.102816117:COnline publication date: 1-May-2025
https://dl.acm.org/doi/10.1016/j.inffus.2024.102816
Nananukul NKekriwal M(2025)Balancing Efficiency and Quality in LLM-Based Entity Resolution on Structured DataSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_21(278-293)Online publication date: 24-Jan-2025
https://doi.org/10.1007/978-3-031-78548-1_21
Cao HDu SHu JYang YHorng SLi T(2024)Graph Deep Active Learning Framework for Data DeduplicationBig Data Mining and Analytics10.26599/BDMA.2023.90200407:3(753-764)Online publication date: Sep-2024
https://doi.org/10.26599/BDMA.2023.9020040
Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685889
Huang WMelo APan JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large-scale Offer Alignment Model for Partitioning Filtering and Matching Product OffersProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661351(2880-2884)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661351
Alves ABaptista CBarbosa LAraujo CHong JPark J(2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636001
Jiang SLan YWang WGuo Z(2024)Pyramid: A Heterogeneous Data Integration Algorithm Based on Hierarchical GraphICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447879(6220-6224)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447879
Yuedanni (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
https://doi.org/10.1109/BigDIA63733.2024.10808744
Papa MChatzigiannakis IAnagnostopoulos A(2024)Automated Natural Language Processing-Based Supplier Discovery for Financial ServicesBig Data10.1089/big.2022.021512:1(30-48)Online publication date: 1-Feb-2024
https://doi.org/10.1089/big.2022.0215
Xu ZWang N(2024)Low-resource entity resolution with domain generalization and active learningNeurocomputing10.1016/j.neucom.2024.128131599(128131)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128131
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten