skip to main content
10.1145/3308560.3316609acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

The WDC Training Dataset and Gold Standard for Large-Scale Product Matching

Published: 13 May 2019 Publication History

Abstract

A current research question in the area of entity resolution (also called link discovery or duplicate detection) is whether and in which cases embeddings and deep neural network based matching methods outperform traditional symbolic matching methods. The problem with answering this question is that deep learning based matchers need large amounts of training data. The entity resolution benchmark datasets that are currently available to the public are too small to properly evaluate this new family of matching methods. The WDC Training Dataset for Large-Scale Product Matching fills this gap. The English language subset of the training dataset consists of 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e-shops which provide schema.org annotations including some form of product ID such as a GTIN or MPN. We also created a gold standard by manually verifying 2200 pairs of offers belonging to four product categories. Using a subset of our training dataset together with this gold standard, we are able to publicly replicate the recent result of Mudgal et al. that embeddings and deep neural network based matching methods outperform traditional symbolic matching methods on less structured data.

References

[1]
Manel Achichi, Michelle Cheatham, 2017. Results of the Ontology Alignment Evaluation Initiative 2017. In Proceedings of OM 2017-12th ISWC workshop on ontology matching. 61–113.
[2]
Sanjib Das, AnHai Doan, 2016. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data.
[3]
Sanjib Das, Paul Suganthan G.C., 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In Proceedings of the 2017 ACM International Conference on Management of Data(SIGMOD ’17). 1431–1446.
[4]
Evangelia Daskalaki, Giorgos Flouris, 2016. Instance Matching Benchmarks in the Era of Linked Data. Journal of Web Semantics 39 (2016), 1 – 14.
[5]
Chaitanya Gokhale, Sanjib Das, 2014. Corleone: Hands-off Crowdsourcing for Entity Matching. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data(SIGMOD ’14). 601–612.
[6]
Anitha Kannan, Inmar E. Givoni, 2011. Matching Unstructured Product Offers to Structured Product Specifications. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’11). 404–412.
[7]
Elias Kärle, Anna Fensel, 2016. Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in the Hotel Domain. In Information and Communication Technologies in Tourism 2016. Cham, 99–112.
[8]
Pradap Konda, Jeff Naughton, and et al.2016. Magellan: toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197–1208.
[9]
Hanna Köpcke and Erhard Rahm. 2008. Training selection for tuning entity matching. In Proceedings of the 6th International Workshop on Quality in Databases and Management of Uncertain Data(QDB/MUD ’08). 3–12.
[10]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493.
[11]
Robert Meusel and Heiko Paulheim. 2015. Heuristics for Fixing Common Errors in Deployed Schema.Org Microdata. In Proceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 9088. 152–168.
[12]
Sidharth Mudgal, Han Li, 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data(SIGMOD ’18). 19–34.
[13]
Petar Petrovski and Christian Bizer. 2017. Extracting Attribute-value Pairs from Product Specifications on the Web. In Proceedings of the International Conference on Web Intelligence(WI ’17). 558–565.
[14]
Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Integrating Product Data from Websites Offering Microdata Markup. In Companion Proceedings of the 23rd International Conference on World Wide Web(WWW ’14 Companion). 1299–1304.
[15]
Petar Petrovski, Anna Primpeli, 2017. The WDC Gold Standards for Product Feature Extraction and Product Matching. In E-Commerce and Web Technologies. 73–86.
[16]
Disheng Qiu, Luciano Barbosa, 2015. Dexter: Large-scale Discovery and Extraction of Product Specifications on the Web. Proceedings of VLDB Endowment 8, 13 (2015), 2194–2205.
[17]
Petar Ristoski, Petar Petrovski, 2018. A machine learning approach for product matching and categorization. Semantic Web 9, 5 (2018), 707–728.
[18]
Kashif Shah, Selcuk Kopru, 2018. Neural Network based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). 8–15.

Cited By

View all

Index Terms

  1. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '19: Companion Proceedings of The 2019 World Wide Web Conference
      May 2019
      1331 pages
      ISBN:9781450366755
      DOI:10.1145/3308560
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • IW3C2: International World Wide Web Conference Committee

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep matching
      2. embeddings
      3. entity resolution
      4. evaluation data
      5. product matching
      6. schema.org annotations

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      WWW '19
      WWW '19: The Web Conference
      May 13 - 17, 2019
      San Francisco, USA

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)61
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Active in-context learning for cross-domain entity resolutionInformation Fusion10.1016/j.inffus.2024.102816117:COnline publication date: 1-May-2025
      • (2025)Balancing Efficiency and Quality in LLM-Based Entity Resolution on Structured DataSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_21(278-293)Online publication date: 24-Jan-2025
      • (2024)Graph Deep Active Learning Framework for Data DeduplicationBig Data Mining and Analytics10.26599/BDMA.2023.90200407:3(753-764)Online publication date: Sep-2024
      • (2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
      • (2024)A Large-scale Offer Alignment Model for Partitioning Filtering and Matching Product OffersProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661351(2880-2884)Online publication date: 10-Jul-2024
      • (2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
      • (2024)Pyramid: A Heterogeneous Data Integration Algorithm Based on Hierarchical GraphICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447879(6220-6224)Online publication date: 14-Apr-2024
      • (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
      • (2024)Automated Natural Language Processing-Based Supplier Discovery for Financial ServicesBig Data10.1089/big.2022.021512:1(30-48)Online publication date: 1-Feb-2024
      • (2024)Low-resource entity resolution with domain generalization and active learningNeurocomputing10.1016/j.neucom.2024.128131599(128131)Online publication date: Sep-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media