Skip to main content

Catalog Integration of Heterogeneous and Volatile Product Data

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1446))

  • 366 Accesses

Abstract

The integration of frequently changing, volatile product data from different manufacturers into a single catalog is a significant challenge for small and medium-sized e-commerce companies. They rely on timely integrating product data to present them aggregated in an online shop without knowing format specifications, concept understanding of manufacturers, and data quality. Furthermore, format, concepts, and data quality may change at any time. Consequently, integrating product catalogs into a single standardized catalog is often a laborious manual task. Current strategies to streamline or automate catalog integration use techniques based on machine learning, word vectorization, or semantic similarity. However, most approaches struggle with low-quality or real-world data. We propose Attribute Label Ranking (ALR) as a recommendation engine to simplify the integration process of previously unknown, proprietary tabular format into a standardized catalog for practitioners. We evaluate ALR by focusing on the impact of different neural network architectures, language features, and semantic similarity. Additionally, we consider metrics for industrial application and present the impact of ALR in production and its limitations.

This work was supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) within the Central Innovation Programme for SMEs (grant no. 16KN063729) and antibodies-online GmbH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/oschmi/antibody-catalog-integration-dataset.

  2. 2.

    https://github.com/oschmi/antibody-catalog-integration-dataset.

References

  1. Allweyer, O., Schorr, C., Krieger, R., Mohr, A.: Classification of products in retail using partially abbreviated product names only. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 67–77. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009821400670077

  2. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908. ACM (2005)

    Google Scholar 

  3. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)

    Article  Google Scholar 

  4. Bizer, C., Primpeli, A., Peeters, R.: Using the semantic web as a source of training data. Datenbank-Spektrum 19(2), 127–135 (2019). https://doi.org/10.1007/s13222-019-00313-y

    Article  Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  6. de Carvalho, M.G., Laender, A.H., Gonçalves, M.A., da Silva, A.S.: An evolutionary approach to complex schema matching. Inf. Syst. 38(3), 302–316 (2013). https://doi.org/10.1016/j.is.2012.10.002

    Article  Google Scholar 

  7. Chen, Z., Jia, H., Heflin, J., Davison, B.D.: Generating schema labels through dataset content analysis. In: Companion Proceedings of the the Web Conference 2018, WWW 2018, pp. 1515–1522. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/3184558.3191601

  8. Comito, C., Patarin, S., Talia, D.: A semantic overlay network for P2P schema-based data integration. In: 11th IEEE Symposium on Computers and Communications (ISCC 2006), pp. 88–94, June 2006. https://doi.org/10.1109/ISCC.2006.19. ISSN 1530-1346

  9. Dessloch, S., Hernandez, M.A., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 1307–1316, April 2008. https://doi.org/10.1109/ICDE.2008.4497540

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  11. Foley, J., Bendersky, M., Josifovski, V.: Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, Santiago, Chile, pp. 423–432. ACM, New York (2015). https://doi.org/10.1145/2766462.2767739

  12. Gu, B., et al.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017). https://doi.org/10.1109/TKDE.2016.2611577

    Article  Google Scholar 

  13. Kirsten, T., Thor, A., Rahm, E.: Instance-based matching of large life science ontologies. In: Cohen-Boulakia, S., Tannen, V. (eds.) DILS 2007. LNCS, vol. 4544, pp. 172–187. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73255-6_15

    Chapter  Google Scholar 

  14. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: VLDB, vol. 1, pp. 49–58 (2001)

    Google Scholar 

  15. Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: BioNLP@ACL (2019). https://doi.org/10.18653/v1/W19-5034

  16. Peters, M.E., et al.: Deep contextualized word representations (2018)

    Google Scholar 

  17. Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain-independent approach. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 446–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_27

    Chapter  Google Scholar 

  18. Pomp, A., Poth, L., Kraus, V., Meisen, T.: Enhancing knowledge graphs with data representatives. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 49–60. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007677400490060

  19. Ristoski, P., Petrovski, P., Mika, P., Paulheim, H.: A machine learning approach for product matching and categorization. Semantic Web 9(5), 707–728 (2018). https://doi.org/10.3233/SW-180300

    Article  Google Scholar 

  20. Schmidts, O., Kraft, B., Siebigteroth, I., Zündorf, A.: Schema matching with frequent changes on semi-structured input files: a machine learning approach on biological product data. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 208–215. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007723602080215

  21. Schmidts, O., Kraft., B., Winkens., M., Zündorf., A.: Catalog integration of low-quality product data by attribute label ranking. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 90–101. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009831000900101

  22. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005). https://doi.org/10.1007/11603412_5

    Chapter  Google Scholar 

  23. Sildatke, M., Karwanni, H., Kraft, B., Schmidts, O., Zündorf, A.: Automated software quality monitoring in research collaboration projects. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW 2020, pp. 603–610. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3387940.3391478

  24. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_34

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Schmidts .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schmidts, O., Kraft, B., Winkens, M., Zündorf, A. (2021). Catalog Integration of Heterogeneous and Volatile Product Data. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83014-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83013-7

  • Online ISBN: 978-3-030-83014-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics